GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

Question

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.

#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
    __m512i sum;
    asm(
        "mov ebx, 0xAAAAAAAA;                                   \n\t"
        "kmovw k1, ebx;                                         \n\t"
        "vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B];  # conditional add   "
        :   [SUM]   "=v"(sum)
        :   [A]     "v" (a),
            [B]     "v" (b)
        : "ebx", "k1"  // clobbers
       );
    return sum;
}

-march=skylake-avx512 -masm=intel -O3

 mov ebx,0xaaaaaaaa
 kmovw k1,ebx
 vpaddd zmm0{k1}{z},zmm0,zmm1

The problem is that k1 has to be specified.

Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?

How about `Yk`? From [here](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/constraints.md#L81). — David Wohlferd, May 02 '19 at 05:55

Peter Cordes · Accepted Answer · 2021-02-25T04:19:19.370

__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.

We have to go digging in the gcc sources config/i386/constraints.md to find it:

The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.

Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.

Prefer intrinsics like `_mm512_maskz_add_epi32`

First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.

Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.

Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.

Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)

Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).

Example using "Yk"

// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
    __m512i sum;
    asm(
        "vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B];  # conditional add   "
        :   [SUM]   "=v"(sum)
        :   [A]     "v" (a),
            [B]     "v" (b),
            [mask]  "Yk" ((__mmask16)0xAAAA)
         // no clobbers needed, unlike your question which I fixed with an edit
       );
    return sum;
}

Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.

This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.

Godbolt compiler explorer:

# gcc8.3 -O3 -march=skylake-avx512 or -march=knl  (and -masm=intel)
add(long long __vector, long long __vector):
        mov     eax, -21846
        kmovw   k1, eax         # compiler-generated
       # inline asm starts
        vpaddd zmm0 {k1}{z}, zmm0, zmm1;  # conditional add   
       # inline asm ends
        ret

-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.

[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.

gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.

"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.

I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?

Just to leave a comment for anyone interested. Here is a naive example on the "Yk" constraint and relevant effects. https://godbolt.org/z/B8VLPz — tert, May 02 '19 at 08:04
@tert: there was a bug in your function I only just noticed: you merge-mask (not zero-mask) into `sum`, but it's uninitialized and you used an output-only `"=v"` for it. I updated my answer to zero-mask instead. And point out some advantages of intrinsics, like letting the compiler use a broadcast load as a memory operand, which is probably very hard to let the compiler choose with inline asm. — Peter Cordes, May 02 '19 at 19:35
This is correct. {z} is necessary. Thank you for pointing it out. — tert, May 03 '19 at 00:06
Is there a constraint to specify that an integer operand can NOT be a k register? I have a problem with the Clang compiler putting a k register into an inline assembly instruction that requires a g.p register. I don't know if this is a bug in Clang or I need to specify the register type. — A Fog, Jun 19 '19 at 08:56

score 6 · Answer 2 · answered May 02 '19 at 06:16

6

While it is undocumented, looking here we see:

(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS : NO_REGS" "@internal Any mask register that can be used as predicate, i.e. k1-k7.")

Editing your godbolt to this:

asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]" 
: [SUM] "=v"(sum) 
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );

seems to produce the correct output.

That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?

answered May 02 '19 at 06:16

David Wohlferd

7,110
2
29
56

Thank you for the answer and the suggestion. I am fully discouraged on this. But when it comes down to SIMD, its either intrinsics or inline asm. Both could produce pain. Understanding one a bit more just might help avoid the other. ; ) – tert May 02 '19 at 06:44
@tert: My advice: *understand* the asm design, then write C intrinsics that can compile to the asm you want. Sometimes you have to look up the right names for intrinsic functions if you only remember the shorter + simpler asm mnemonics, but if you want to look at any serious performance tuning you know that the asm is what really matters, and stuff like https://agner.org/optimize/ or https://uops.info/ have instruction tables that use asm mnemonics. And Intel's intrinsics finder is searchable by mnemonic to find intrinsics. – Peter Cordes May 02 '19 at 18:52
1

@tert: But it's still early days for AVX512 mask registers, so compilers almost always copy them to integer registers and back even if a single `kshift` or `kor` would have worked. But Intel has finally added mask intrinsics that really do compile to mask instructions like `kadd`, so if you need to help the compiler make better asm you can use those. Supported in current versions of all the major compilers. [Missing AVX-512 intrinsics for masks?](//stackoverflow.com/a/45174179) – Peter Cordes May 02 '19 at 18:54

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

2 Answers2

Prefer intrinsics like `_mm512_maskz_add_epi32`

Example using "Yk"

Linked

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

2 Answers2

Prefer intrinsics like _mm512_maskz_add_epi32

Example using "Yk"

Linked

Prefer intrinsics like `_mm512_maskz_add_epi32`