But how is the first assembly code retrieving and storing the struct returned from the subroutine?
First of all, it doesn't return a struct, it returns a pointer to a struct in EAX. The function's return type is struct _A*. You don't show what it's pointing to; perhaps some static buffer in a non-thread-safe function?
It looks like you left out a rep movsd in the first example after setting up esi, edi, and ecx (your esx is obviously a typo). This will memcpy 4*8 = 32 bytes from the pointer returned in EAX to the static storage for A. (Note the mov edi, offset A to get the actual address of A into EDI.)
With a smaller struct, it copies it with a few mov instructions instead of setting up for a rep movsd (which has considerable startup overhead and is a bad choice for a 32-byte copy if SSE was available). i.e. it fully unrolls a copy loop.
(In the first version of the, I didn't look closely enough at the code, and based on the wording thought you were actually returning a struct by value when you talked about returning a struct. Seems a shame to delete what I wrote about that related case. Instead of hidden pointer, you have an explicit pointer to an object that exists in the C++, not just in the asm implementation of what the C++ abstract machine does.)
Large struct by-value returns are returned by hidden pointer (the caller passes in a pointer as the first arg, and the function returns it in EAX for the convenience of the caller). This is typical for most calling conventions; see links to calling convention docs in the x86 tag wiki.
The value of A itself is 32 bytes, and doesn't fit in a register. Often in asm you need a pointer to an object. push OFFSET A is probably part of calling a function that takes A by reference (probably explicitly in the C++ source; I don't think any of the standard x86 calling conventions implement pass-by-value as pass-by-const-reference, only by non-const reference e.g. for Windows x64, and maybe others).
Your compiler probably couldn't optimize A = foo(); (returning a large struct by value) by passing the address of A directly as the output pointer.
A is global, and the callee is allowed to assume that its return-value buffer doesn't alias the global A. The caller can't assume that the function doesn't access A directly, but according to the C++ abstract machine the value of A doesn't change until after the function returns.