6

I have 2 variables to emulate X86 XMM & YMM, like below:

uint64_t xmm_value[2];
uint64_t ymm_value[4];

Now I want to use inline assembly to read & write to/from XMM/YMM registers.

  • How to write GCC inline assembly to copy xmm_value to register XMM0?
  • How to write GCC inline assembly to copy register YMM0 to ymm_value?

I already tried to search for sample inline assembly doing this, but could not find any good answer. Thanks!


So with some helps, I wrote this code, and it compiled OK. I use movups for XMM, and vmovups for YMM, like below. Is this correct, and can I still optimize my code?

__m128 xmm0;
__m256 ymm0;

// write to XMM0, and read from YMM0
__asm__("movups %1, %%xmm0\n\t"
        "vmovups %%ymm0, %0"
        : "=m"(ymm0)
        : "m"(xmm0)
        : "xmm0", "ymm0");

Update 2: here is my full code (with vpbroadcastb added)

__m128 xmm0;
__m256 ymm0;

// write to XMM0, and read from YMM0
__asm__("movups %1, %%xmm0\n\t"
        "vpbroadcastb %%xmm0, %%ymm0\n\t"
        "vmovups %%ymm0, %0"
        : "=m"(ymm0)
        : "m"(xmm0)
        : "xmm0", "ymm0");

The idea is that I want to copy xmm0 (variable) to XMM0, then run vpbroadcastb, then copy out the result in YMM0 to ymm0 (variable). Now I realize that XMM0 is a lower part of YMM0, so this code can still be improved?

aqua2019
  • 81
  • 1
  • 7
  • 2
    Why do you need to use inline assembly for this? – Michael Petch Aug 01 '19 at 16:24
  • There are a few howto available like [1](https://wiki.osdev.org/Inline_Assembly), [2](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html), etc.... – U880D Aug 01 '19 at 16:38
  • 1
    Possible duplicate of [In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?](https://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand) – paulsm4 Aug 01 '19 at 16:39
  • Thanks for the pointers, but those links do not directly answer my issue, so I think others can still benefit from this question. – aqua2019 Aug 02 '19 at 02:39
  • 1
    Why would you want to use memory operands instead of XMM registers? Also, you probably want a `vmovups %1, %%xmm0` to zero-extend into YMM0 (just like writing EAX implicitly zero-extends into RAX). Writing XMM0 with a legacy SSE instruction leaves the upper lane unmodified. See also [Why is this SSE code 6 times slower without VZEROUPPER on Skylake?](//stackoverflow.com/q/41303780) for XMM false dependencies or SSE/AVX transition stalls. (*This* won't cause a transition stall on Haswell unless there are any YMM registers with dirty uppers, but mixing SSE and AVX requires care) – Peter Cordes Aug 02 '19 at 02:50
  • I think your attempt is safe from a correctness/safety point of view, but from any other POV (performance, maintainability, sanity of variable names) it makes zero sense vs. [`ymm0 = _mm256_castps128_ps256(xmm0)`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,FMA,Other&expand=620&text=castps). Since XMM0 is the low lane of YMM0, having 2 separate C variables is just confusing. – Peter Cordes Aug 02 '19 at 02:53
  • Ah now I realize that XMM0 is a lower part of YMM0! Peter, actually I have another instruction in between those movpus/vmovups, see in my latest update. What is your suggestion? Thanks! – aqua2019 Aug 02 '19 at 03:23
  • Use `__m256 ymm0 = _mm256_set1_epi8( some_char );` Or in inline asm, use a memory source operand like `vpbroadcastb %1, %%ymm0`. There's zero point in having a `__m128i xmm0` variable exist at all if you're going to force it to be in memory and to do this with it. Look at the full compiler asm output (including compiler-generated asm around your inline asm) for a function using this and see how obviously bad it is. [How to remove "noise" from GCC/clang assembly output?](//stackoverflow.com/q/38552116). – Peter Cordes Aug 02 '19 at 03:28
  • what do you mean by "some_char" here? i still need to copy input variable to XMM0, so "vpbroadcast %xmm0, %ymm0" works. – aqua2019 Aug 02 '19 at 03:35
  • or if there is any intrinsic for `vpbroadcastb`, i would be happy to use that instead of writing this inline assembly code. – aqua2019 Aug 02 '19 at 04:10
  • "some char" - vpbroadcastb will *Broadcast an 8-bit value from a GPR to all bytes in the 256-bit destination.* An 8 bit value is a char. BTW, if that's really all you're doing, how is that different than `memset(ymm0, "some char", 16)` and get rid of all that inline asm crap? I have completely lost sight of what you are trying to accomplish here. – David Wohlferd Aug 02 '19 at 05:00
  • oh that memset() makes a lot more sense, I will try that, thanks! Btw, this "vpbroadcastb" is from libc, I have no idea why the compiler broadcast XMM0, but not a char. – aqua2019 Aug 02 '19 at 07:16

1 Answers1

3

The first step is to #include <immintrin.h>, which includes all the definitions for the needed types as well as all the Intel Intrinsics for accessing all the MMX/SSE/AVX instructions. For most purposes, you want to use those intrinsics and not inline assembly, as they are clearer and more portable, but if you really want to use inline asm, you can use the intrinsic types (__m64, __m128, __m128d, __m256, etc) along with an x constraint to bind to the correct kind of xmm/ymm register.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Chris, that is a nice pointer, but i cannot find any intrinsic let me read/write to a particular register, such as xmm0 or ymm0. Or do i miss something? Thanks! – aqua2019 Aug 01 '19 at 17:56
  • 1
    Why do you need to access a "particular register?" If you declare a variable as `__m128 myvar;` then you just use myvar. You don't know (or care) which number register it ends up in, you just pass it to the appropriate intrinsic to perform the desired function. Or if for some reason you do care, you need to give us more info telling us why. – David Wohlferd Aug 01 '19 at 21:43
  • David, i *have to* access to a particular register, for very special request of my low level project. This works at machine level, that i cannot avoid. The project is quite complicated, involving JIT, if you still want to know, thanks. – aqua2019 Aug 01 '19 at 23:35
  • 1
    It sounds like you are calling a function that takes a 128 bit value as an input, and returns a 256 bit value, and the function you are calling uses xmm0 for input and ymm0 for output. However, depending on your environment, those can actually be the registers the compiler normally uses to pass values for functions declared like `extern "C" __m256 example(__m128 a);`. I'd certainly experiment with that first. Solutions using inline asm are tricky, error prone, difficult to support, and should always be your last choice. Calling functions from inline asm (as here) are particularly bad. – David Wohlferd Aug 02 '19 at 00:43
  • 2
    @aq2019: you can use `register __m128 foo asm("xmm0")` to force an `"x"` constraint to pick XMM0. This sounds like a really terrible idea vs. JITing functions that follow the standard calling convention so you can write C prototypes, but if you want to write hard-to-maintain error-prone asm, that's how. Don't forget that you can't safely clobber the red-zone in inline asm so you need `add $-128, %rsp` before a `call`, for example. – Peter Cordes Aug 02 '19 at 01:24
  • I updated my question with my solution. Do you guys have any comments? – aqua2019 Aug 02 '19 at 02:36
  • Do you intend to invoke some code between those 2 statements? Looks pretty bizarre as written. – David Wohlferd Aug 02 '19 at 02:44
  • yes, just some simple code. but since I declared clobbered registers, there is no issue, right? – aqua2019 Aug 02 '19 at 03:10
  • Does that "simple code" include a `call` instruction? As peter and I have indicated, doing that within inline asm can be tricky. – David Wohlferd Aug 02 '19 at 03:20
  • David, yes there are some `call` between those variables (xmm0, ymm0), and the inline assembly. What do I need to pay attention to? – aqua2019 Aug 02 '19 at 03:25
  • 1
    Is this a single asm with mov, call, mov? That's a problem since the function you are calling can clobber registers (like eax), and you don't list them as clobbered. Also there are stack alignment questions and red zone issues. If you are doing this with multiple asm statements, the order of execution is not guaranteed, and the contents of the first assignment aren't guaranteed to be preserved past the asm instruction. It might work anyway, but designing your code based on breaking rules is a bad idea. Which is why I keep trying to steer you away from this approach. – David Wohlferd Aug 02 '19 at 04:55
  • but the inline assembly above does not clobber any registers. 3 lines of inline code above just do: copy stack variable to XMM0, then vpbroadcast XMM0 to YMM0, then copy YMM0 to another stack variable. So this is safe, regardless of any code around it. – aqua2019 Aug 02 '19 at 07:21
  • @PeterCordes Re: "`"x"` constraint to pick XMM0": thanks! Indeed, per [manual](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html) the `x` stands for "Any SSE register". – pmor Aug 09 '23 at 10:11