Optimizing register usage in dot product

Question

I'm developing a kernel function with several vector operations like scalar and vector products. The kernel uses a large amount of registers so that occupancy is very low. I'm trying to reduce the amount of used registers to improve occupancy.

Consider for example the following __device__ function performing a scalar product between two float3:

__device__ float dot(float3 in1, float3 in2) { return in1.x * in2.x + in1.y * in2.y + in1.z * in2.z; }

If I generate the .ptx file using

nvcc -ptx -gencode arch=compute_52,code=sm_52 -rdc=true simpleDot2.cu

(the file simpleDot2.cu contains only the definition of the __device__ function), I essentially obtain

    // .globl   _Z3dot6float3S_
.visible .func  (.param .b32 func_retval0) _Z3dot6float3S_(
    .param .align 4 .b8 _Z3dot6float3S__param_0[12],
    .param .align 4 .b8 _Z3dot6float3S__param_1[12]
)
{
    .reg .f32   %f<10>;


    ld.param.f32    %f1, [_Z3dot6float3S__param_0+8];
    ld.param.f32    %f2, [_Z3dot6float3S__param_0];
    ld.param.f32    %f3, [_Z3dot6float3S__param_0+4];
    ld.param.f32    %f4, [_Z3dot6float3S__param_1+8];
    ld.param.f32    %f5, [_Z3dot6float3S__param_1];
    ld.param.f32    %f6, [_Z3dot6float3S__param_1+4];
    mul.f32     %f7, %f3, %f6;
    fma.rn.f32  %f8, %f2, %f5, %f7;
    fma.rn.f32  %f9, %f1, %f4, %f8;
    st.param.f32    [func_retval0+0], %f9;
    ret;
}

From the .ptx code, it seems that a number of 9 registers are used, which perhaps can be lowered. I understand that the .ptx code is not the ultimate code executed by a GPU.

Question

Is there any chance to rearrange the register usage in the .ptx code, for example recycling registers f1-f6, so to reduce the overall number of occupied registers?

Thank you very much for any help.

I see your caveat about PTX, but you really have to compile and disassemble to see the definitive register footprint of the code and then start thinking about how to bend the assembler output to your will — talonmies, Mar 12 '18 at 13:16
@talonmies Following your suggestion, I have done several attempts. Now, with Norbert's answer, I'm definitely convinced that this is not the way to go. Thank you for your help. — Vitality, Mar 14 '18 at 16:03
Another comment in addition to the excellent answer by @njuffa. I notice you are using rdc. This will definitely have a negative impact on register pressure, for at least 2 reasons I can think of. If you have lots of these low-level vector operations, you might investigate re-structuring the library in such a way that you can avoid the rdc impact on these functions. — Robert Crovella, Mar 15 '18 at 13:49
@RobertCrovella Thank you Robert. Is relocatable compilation responsible of a higher register pressure due to missed inlining? — Vitality, Mar 15 '18 at 16:16
The two reasons I know of are the impact of needing to create a procedure call stack as well as the result of missed optimization since it can't be inlined. — Robert Crovella, Mar 15 '18 at 16:23
@RobertCrovella Would you like to add a short answer complementing that by @njuffa? I would upvote it. — Vitality, Mar 16 '18 at 13:40
[This answer](https://stackoverflow.com/questions/34046227/does-calling-device-functions-impact-the-number-of-registers-used-in-cuda/34046510#34046510) covers what I would say generally. — Robert Crovella, Mar 16 '18 at 14:58

score 3 · Accepted Answer · answered Mar 14 '18 at 15:16

TL;DR To first order, no.

PTX is both a virtual ISA and a compiler intermediate representation. The registers used in PTX code are virtual registers and there is no fixed relation to the physical registers of the GPU. The PTX code generated by the CUDA toolchain follows the SSA (static single assignment) convention. This means that every virtual register is written to exactly once. Stated differently: When an instruction produces a result it is assigned to a new register. This means that longer kernels may use thousands of registers.

In the CUDA toolchain, PTX code is compiled to machine code (SASS) by the ptxas component. So despite the name, this is not an assembler, but an optimizing compiler that can do loop unrolling, CSE (common subexpression elimination), and so on. Most importantly, ptxas is responsible for register allocation and instruction scheduling, plus all optimizations specific to a particular GPU architecture.

As a consequence, any examination of register usage issues needs to be focused on the machine code, which can be extracted with cuobjdump --dump-sass. Furthermore, the programmer has very limited influence on the number of registers used, because ptxas uses numerous heuristics when determining register allocation, in particular to trade off register usage with performance: scheduling loads early tends to increases register pressure by extension of the life range, so does the creation of temporary variable during CSE or the creation of induction variable for strength reduction in loops.

Modern versions of CUDA that target compute capability of 3.0 and higher usually make excellent choices when determining these trade-offs, and it is rarely necessary for programmers to consider register pressure. It is not clear what motivates asker's question in this regard.

The documented mechanisms in CUDA to control maximum register usage are the -maxrregcount command-line flag of nvcc, which applies to an entire compilation unit, and the __launch_bounds__ attribute that allows control on a per-kernel basis. See the CUDA documentation for details. Beyond that, one can try to influence register usage by choosing the pxtas optimization level with -Xptxas -O{1|2|3} (default is -O3), or by re-arranging source code, or by use of compiler flags that tend to simplify the generated code, such as -use_fast_math.

Of course such indirect methods could have numerous other effects that are generally unpredictable, and any desirable result achieved will be "brittle", e.g. easily destroyed by changing to a new version of the toolchain.

Thank you very much for your answer. The reason for asking is that I have a code with many vector operations (dot, cross) which consume many registers and this reduces occupancy to about `20%`. Anyway, since issuing the post, I have done several attempts and, with your answer, I'm now convinced that this is not the way to go. Thank you again. — Vitality, Mar 14 '18 at 16:02

Optimizing register usage in dot product

1 Answers1