2
    asm (   "movd %2,   %%xmm0;" // valueToAdd
            "movd %3,   %%xmm1;" // accumulatedError
            "movd %4,   %%xmm2;" // accumulator
            "movss %%xmm2,%%xmm3;" // accumulator
            "subss %%xmm1,%%xmm0;" // valueToAdd - accumulatedError ---> y
            "addss %%xmm0,%%xmm2;" // accumulator + y ----> t
            "movss %%xmm2,%%xmm4;" // t---->t
            "subss %%xmm3,%%xmm2;" // t - accumulator ----> (t2)
            "subss %%xmm0,%%xmm2;"  // t2-y ---> (t3)[accumulated error]

            "movd %%xmm2,%0;"
            "movd %%xmm4,%1;"
            : "=r"(accumulatedError),"=r"(accumulator)
            : "r"(valueToAdd),"r"(accumulatedError),"r"(accumulator)
            : "%xmm0","%xmm1","%xmm2","%xmm3","%xmm4"
    );

here I can insert 32-bit floats into asm block by using r tag. Now I need to make this block "vectorized" and I don't know how would sse/avx registers be tagged.

Variable types that I need to pass are __m128 and __m256. How can I introduce them(xmm,ymm) to inline-asm as constraints instead of "r"?

Compiler is G++ 6.3.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • 1
    possible duplicate of https://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand – Andreas H. Aug 06 '17 at 15:12
  • Which processor? Assembly code is processor dependent. Your code won't work with a 32-bit ARM processor. – Thomas Matthews Aug 06 '17 at 16:02
  • @ThomasMatthews only sse4.2 and avx-1 cpus. Should work? – huseyin tugrul buyukisik Aug 06 '17 at 16:03
  • Also, somebody should ask: Have you considered using intrinsics instead of inline asm? `_mm_sub_ss`, `_mm_add_ss`, etc might produce better code. – David Wohlferd Aug 07 '17 at 00:40
  • With volatile keyword, generated code is memory-fetching - scalar-computing avx code. Without volatile, code order is scrambled by optimization and loses its algorithm strength to be able to add 1.0 until it gets 1e9. – huseyin tugrul buyukisik Aug 07 '17 at 00:42
  • Have you tried reading the documentation? – fuz Aug 08 '17 at 10:03
  • I looked here https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints but it doesn't say if its for a version. Is it for all versions? (anyway, it says w is for sse and v is for avx) – huseyin tugrul buyukisik Aug 08 '17 at 10:05
  • That link takes you to gcc's 'current development.' If you want to find the docs for a specific version of gcc, try https://gcc.gnu.org/onlinedocs/ – David Wohlferd Aug 09 '17 at 09:18
  • 1
    They don't say anything about ymm registers for 6.4, I'm using 6.3 and that manual says v is for ymm and works. – huseyin tugrul buyukisik Aug 09 '17 at 11:36
  • according to [this](https://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand), you might replace %0 with %x0 or %t0 for xmm, ymm respectively. As @DavidWohlferd said, try implement by using intrinsics, fine compilers might have more chance to optimize overall performance even more for you. – sandthorn Nov 12 '17 at 09:32
  • @sandthorn Yes, thats helpful thanks. I wish there were a vectorized form of memory fence. I was doing this for performance reasons but may not be necessary in future when I learn Cuda. – huseyin tugrul buyukisik Nov 12 '17 at 11:09

0 Answers0