GCC inline assembler, mixing register sizes (x86)

Question

Does anyone know how I can get rid of the following assembler warning?

Code is x86, 32 bit:

int test (int x)
{
  int y;
  // do a bit-rotate by 8 on the lower word. leave upper word intact.
  asm ("rorw $8, %0\n\t": "=q"(y) :"0"(x));
  return y;
}

If I compile it I get the following (very valid) warning:

Warning: using `%ax' instead of `%eax' due to `w' suffix

What I'm looking for is a way to tell the compiler/assembler that I want to access the lower 16 bit sub-register of %0. Accessing the byte sub-registers (in this case AL and AH) would be nice to know as well.

I've already chosen the "q" modifier, so the compiler is forced to use EAX, EBX, ECX or EDX. I've made sure the compiler has to pick a register that has sub-registers.

I know that I can force the asm-code to use a specific register (and its sub-registers), but I want to leave the register-allocation job up to the compiler.

C. K. Young · Accepted Answer · 2020-08-24T06:47:49.957

21

You can use %w0 if I remember right. I just tested it, too. :-)

int
test(int x)
{
    int y;
    asm ("rorw $8, %w0" : "=q" (y) : "0" (x));
    return y;
}

Edit: In response to the OP, yes, you can do the following too:

int
test(int x)
{
    int y;
    asm ("xchg %b0, %h0" : "=Q" (y) : "0" (x));
    return y;
}

For x86 it's documented in the x86 Operand Modifiers section of the Extended Asm part of the manual.

For non-x86 instruction sets, you may have to dig through their .md files in the GCC source. For example, gcc/config/i386/i386.md was the only place to find this before it was officially documented.

(Related: In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand? for vector registers.)

edited Aug 24 '20 at 06:47

answered Sep 23 '08 at 02:01

C. K. Young

219,335
46
382
435

I tested as well.. Do you know the modifiers for the low and high bytes as well? – Nils Pipenbrinck Sep 23 '08 at 02:06
1

`xchg %al, %ah` is 3 uops on Intel CPUs, and reading the 16-bit ax causes a partial-register stall or extra uops on some CPUs. `ror $8, %ax` is 1 uop, so it definitely preferable. Also, operand modifiers are [now documented in the manual](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86Operandmodifiers) (using this same example, probably not a coincidence :P). See also: operand modifiers for vector regs: http://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand – Peter Cordes Aug 17 '16 at 22:21

Nathan Kurz · Answer 2 · 2013-07-24T09:39:15.567

Long ago, but I'll likely need this for my own future reference...

Adding on to Chris's fine answer says, the key is using a modifier between the '%' and the number of the output operand. For example, "MOV %1, %0" might become "MOV %q1, %w0".

I couldn't find anything in constraints.md, but /gcc/config/i386/i386.c had this potentially useful comment in the source for print_reg():

/* Print the name of register X to FILE based on its machine mode and number.
   If CODE is 'w', pretend the mode is HImode.
   If CODE is 'b', pretend the mode is QImode.
   If CODE is 'k', pretend the mode is SImode.
   If CODE is 'q', pretend the mode is DImode.
   If CODE is 'x', pretend the mode is V4SFmode.
   If CODE is 't', pretend the mode is V8SFmode.
   If CODE is 'h', pretend the reg is the 'high' byte register.
   If CODE is 'y', print "st(0)" instead of "st", if the reg is stack op.
   If CODE is 'd', duplicate the operand for AVX instruction.
 */

A comment below for ix86_print_operand() offer an example:

b -- print the QImode name of the register for the indicated operand.

%b0 would print %al if operands[0] is reg 0.

A few more useful options are listed under Output Template of the GCC Internals documentation:

‘%cdigit’ can be used to substitute an operand that is a constant value without the syntax that normally indicates an immediate operand.

‘%ndigit’ is like ‘%cdigit’ except that the value of the constant is negated before printing.

‘%adigit’ can be used to substitute an operand as if it were a memory reference, with the actual operand treated as the address. This may be useful when outputting a “load address” instruction, because often the assembler syntax for such an instruction requires you to write the operand as if it were a memory reference.

‘%ldigit’ is used to substitute a label_ref into a jump instruction.

‘%=’ outputs a number which is unique to each instruction in the entire compilation. This is useful for making local labels to be referred to more than once in a single template that generates multiple assembler instructions.

The '%c2' construct allows one to properly format an LEA instruction using an offset:

#define ASM_LEA_ADD_BYTES(ptr, bytes)                            \
    __asm volatile("lea %c1(%0), %0" :                           \
                   /* reads/writes %0 */  "+r" (ptr) :           \
                   /* reads */ "i" (bytes));

Note the crucial but sparsely documented 'c' in '%c1'. This macro is equivalent to

ptr = (char *)ptr + bytes

but without making use of the usual integer arithmetic execution ports.

Edit to add:

Making direct calls in x64 can be difficult, as it requires yet another undocumented modifier: '%P0' (which seems to be for PIC)

#define ASM_CALL_FUNC(func)                                         \
    __asm volatile("call %P0") :                                    \
              /* no writes */ :                                     \
              /* reads %0 */ "i" (func))

A lower case 'p' modifier also seems to function the same in GCC, although only the capital 'P' is recognized by ICC. More details are probably available at /gcc/config/i386/i386.c. Search for "'p'".

The "full" table is now in that sourcefile as well, as comment just before the function `ix86_print_operand()`. It also mentions (amongst others) the `%p..`/`%P..`. — FrankH., Oct 19 '13 at 07:26
For future readers: QI = quarter-int, HI = half, SI = single-int, DI=double-int, TI=tetra-int width. — Peter Cordes, Aug 22 '20 at 22:10
Just for historical interest, the fact that these are now documented in the GCC manual, using the examples from this question, is likely due to the OP reporting http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37621 - initially a GCC dev didn't want to set these internal in stone by documenting them, but at some point it got done. — Peter Cordes, May 01 '22 at 04:48

score 3 · Answer 3 · answered Sep 23 '08 at 17:12

While I'm thinking about it ... you should replace the "q" constraint with a capital "Q" constraint in Chris's second solution:

int
test(int x)
{
    int y;
    asm ("xchg %b0, %h0" : "=Q" (y) : "0" (x));
    return y;
}

"q" and "Q" are slightly different in 64-bit mode, where you can get the lowest byte for all of the integer registers (ax, bx, cx, dx, si, di, sp, bp, r8-r15). But you can only get the second-lowest byte (e.g. ah) for the four original 386 registers (ax, bx, cx, dx).

score 0 · Answer 4 · answered Sep 23 '08 at 06:20

0

So apparently there are tricks to do this... but it may not be so efficient. 32-bit x86 processors are generally slow at manipulating 16-bit data in general purpose registers. You ought to benchmark it if performance is important.

Unless this is (a) performance critical and (b) proves to be much faster, I would save myself some maintenance hassle and just do it in C:

uint32_t y, hi=(x&~0xffff), lo=(x&0xffff);
y = hi + (((lo >> 8) + (lo << 8))&0xffff);

With GCC 4.2 and -O2 this gets optimized down to six instructions...

answered Sep 23 '08 at 06:20

Dan Lenski

76,929
13
76
124

2

How is 6 instructions supposed to be faster than 1 instruction?! My timing tests (for a billion runs, 5 trials) were: my version = (4.38, 4.48, 5.03, 4.10, 4.18), your version = (5.33, 6.21, 5.62, 5.32, 5.29). – C. K. Young Sep 23 '08 at 11:21
So, we're looking at a 20% speed improvement. Isn't that "much faster"? – C. K. Young Sep 23 '08 at 11:23
1

Chris, absolutely right... your version *is* faster it seems. But not nearly as much as 6-instructions-vs.-1-instruction would lead you to expect, and that's what I was warning about. I didn't actually do the comparison myself, so props to you for testing it!! – Dan Lenski Sep 23 '08 at 16:38
@Dan, I need that lower byte swapping primitive for a larger tweak. I know that 16 bit operations in 32 bit code have been slow and frowned upon, but the code will be surrounded with other 32 bit operations. I hope that the slowness of the 16 bit code will just get lost in the out of order scheduling. What I want to archive in the end is a mechansim to do all 24 possible byte permutation of a dword in-place. For this you need only three instructions at most: low-byte swap (e.g. xchg al, ah), bswap and 32 bit rotates. The in-place way does not need any constants (faster code fetch / decode time – Nils Pipenbrinck Sep 23 '08 at 12:01
1

The difference will be much bigger on Sandybridge-family CPUs, than on 2008-era Core2 or Nehalem CPUs, which stall for 2 or 3 cycles while inserting a merging uop, vs no stall on SnB. On Haswell, partial-register slowdowns are completely eliminated. See Agner Fog's microarch pdf for info about partial-register penalties. http://stackoverflow.com/tags/x86/info – Peter Cordes Aug 17 '16 at 22:26

GCC inline assembler, mixing register sizes (x86)

4 Answers4

Linked

Related