I'm developing a kernel function with several vector operations like scalar and vector products. The kernel uses a large amount of registers so that occupancy is very low. I'm trying to reduce the amount of used registers to improve occupancy.
Consider for example the following __device__ function performing a scalar product between two float3:
__device__ float dot(float3 in1, float3 in2) { return in1.x * in2.x + in1.y * in2.y + in1.z * in2.z; }
If I generate the .ptx file using
nvcc -ptx -gencode arch=compute_52,code=sm_52 -rdc=true simpleDot2.cu
(the file simpleDot2.cu contains only the definition of the __device__ function), I essentially obtain
// .globl _Z3dot6float3S_
.visible .func (.param .b32 func_retval0) _Z3dot6float3S_(
.param .align 4 .b8 _Z3dot6float3S__param_0[12],
.param .align 4 .b8 _Z3dot6float3S__param_1[12]
)
{
.reg .f32 %f<10>;
ld.param.f32 %f1, [_Z3dot6float3S__param_0+8];
ld.param.f32 %f2, [_Z3dot6float3S__param_0];
ld.param.f32 %f3, [_Z3dot6float3S__param_0+4];
ld.param.f32 %f4, [_Z3dot6float3S__param_1+8];
ld.param.f32 %f5, [_Z3dot6float3S__param_1];
ld.param.f32 %f6, [_Z3dot6float3S__param_1+4];
mul.f32 %f7, %f3, %f6;
fma.rn.f32 %f8, %f2, %f5, %f7;
fma.rn.f32 %f9, %f1, %f4, %f8;
st.param.f32 [func_retval0+0], %f9;
ret;
}
From the .ptx code, it seems that a number of 9 registers are used, which perhaps can be lowered. I understand that the .ptx code is not the ultimate code executed by a GPU.
Question
Is there any chance to rearrange the register usage in the .ptx code, for example recycling registers f1-f6, so to reduce the overall number of occupied registers?
Thank you very much for any help.