How can I specify xmm/ymm registers as input/output for inline asm in C++?

Question

    asm (   "movd %2,   %%xmm0;" // valueToAdd
            "movd %3,   %%xmm1;" // accumulatedError
            "movd %4,   %%xmm2;" // accumulator
            "movss %%xmm2,%%xmm3;" // accumulator
            "subss %%xmm1,%%xmm0;" // valueToAdd - accumulatedError ---> y
            "addss %%xmm0,%%xmm2;" // accumulator + y ----> t
            "movss %%xmm2,%%xmm4;" // t---->t
            "subss %%xmm3,%%xmm2;" // t - accumulator ----> (t2)
            "subss %%xmm0,%%xmm2;"  // t2-y ---> (t3)[accumulated error]

            "movd %%xmm2,%0;"
            "movd %%xmm4,%1;"
            : "=r"(accumulatedError),"=r"(accumulator)
            : "r"(valueToAdd),"r"(accumulatedError),"r"(accumulator)
            : "%xmm0","%xmm1","%xmm2","%xmm3","%xmm4"
    );

here I can insert 32-bit floats into asm block by using r tag. Now I need to make this block "vectorized" and I don't know how would sse/avx registers be tagged.

Variable types that I need to pass are __m128 and __m256. How can I introduce them(xmm,ymm) to inline-asm as constraints instead of "r"?

Compiler is G++ 6.3.

possible duplicate of https://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand — Andreas H., Aug 06 '17 at 15:12
Which processor? Assembly code is processor dependent. Your code won't work with a 32-bit ARM processor. — Thomas Matthews, Aug 06 '17 at 16:02
Also, somebody should ask: Have you considered using intrinsics instead of inline asm? `_mm_sub_ss`, `_mm_add_ss`, etc might produce better code. — David Wohlferd, Aug 07 '17 at 00:40
With volatile keyword, generated code is memory-fetching - scalar-computing avx code. Without volatile, code order is scrambled by optimization and loses its algorithm strength to be able to add 1.0 until it gets 1e9. — huseyin tugrul buyukisik, Aug 07 '17 at 00:42
I looked here https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints but it doesn't say if its for a version. Is it for all versions? (anyway, it says w is for sse and v is for avx) — huseyin tugrul buyukisik, Aug 08 '17 at 10:05
That link takes you to gcc's 'current development.' If you want to find the docs for a specific version of gcc, try https://gcc.gnu.org/onlinedocs/ — David Wohlferd, Aug 09 '17 at 09:18
They don't say anything about ymm registers for 6.4, I'm using 6.3 and that manual says v is for ymm and works. — huseyin tugrul buyukisik, Aug 09 '17 at 11:36
according to [this](https://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand), you might replace %0 with %x0 or %t0 for xmm, ymm respectively. As @DavidWohlferd said, try implement by using intrinsics, fine compilers might have more chance to optimize overall performance even more for you. — sandthorn, Nov 12 '17 at 09:32
@sandthorn Yes, thats helpful thanks. I wish there were a vectorized form of memory fence. I was doing this for performance reasons but may not be necessary in future when I learn Cuda. — huseyin tugrul buyukisik, Nov 12 '17 at 11:09

How can I specify xmm/ymm registers as input/output for inline asm in C++?

0 Answers0