An Introduction to Writing FP16 code for NVIDIA’s GPUs

FP16 is an IEEE format which has reduced #bits compared to traditional floating point format (i.e 32bits = “float” keyword we use in C/C++). The main reason for going about using this reduced precision FP16 is because there is hardware speedup available with using FP16 (if you are okay with the precision loss that is) and there is 2X space savings.

Specifically, certain GPUs offer anywhere between 2X to 8X speedup on FP16 compared to FP32. Despite this, we often stick to using FP32 (as beginners) because getting started on FP16 can be a bit tricky mainly due to:

  1. There is no inbuilt support for FP16 in the C/C++ language — apparently ARM version of C/C++ has pretty good support, but on x86, we need to use either special instructions or libraries to convert from FP32->FP16 and back (which we cover in this article)
  2. As a beginner — it is easy to to be intimidated by this all type conversions and specially when the writing host/CPU code and passing data/pointers to the device/GPU side.

So in this article I try to give a brief introduction on how to go about writing code which uses FP16. The first hiccup in writing FP16 kernels is writing the host code and - for that we have 2 options options to create FP16 arrays on the CPU.

Option #1 :
uint16_t :
If you think about it, FP16 can actually be stored on the host/cpu side as unsigned short int (which also uses up 16 bits) — but we need some mechanism to convert from “float” to this 16 bit format. On X86 machines, this can be achieved using Intel’s FP16 conversion intrinsics, available as a part of “emmintrin.h” — if you are specifically interested in this, you can read more here, but note that this requires icc (or Intel’s C++ compiler for using the library, which nvcc can be told to use as its host compiler — have a look here and here).

Option #2 :
__half
: This is a datatype which is available as a part of NVIDIA’s FP16 library “cuda_fp16.h”. In my opinion, this has been the easiest way to get FP16 working on the CPU side as it readily provides functions to convert to and from FP16 format to others. I’ll be discussing more about this option here but for more details on the library itself and the functions it provides have a look here.

For example — have a look at a basic code sample below, it shows how easy it is to use the __half datatype seamlessly across both the CPU and the GPU.

#include <iostream>
#include <stdint.h>
#include <cuda_fp16.h>
using namespace std;
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void half_plus1(__half *in_array)
{
const int idx = threadIdx.x + blockDim.x*blockIdx.x;
in_array[idx] = __float2half(__half2float(in_array[idx]) + 1.0);
}
int main(void)
{
const int n = 64;
__half *h_in, *d_in;
h_in = (__half*) malloc(n*sizeof(__half));
gpuErrchk( cudaMalloc(&d_in, n*sizeof(__half)) );
for (int i=0; i<n; i++)
h_in[i] = __float2half(1.5);
gpuErrchk( cudaMemcpy(d_in, h_in, n*sizeof(__half), cudaMemcpyHostToDevice) ); dim3 block_dims(2,1,1);
dim3 thread_dims(32,1,1);
half_plus1<<<block_dims, thread_dims>>>(d_in); gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
gpuErrchk( cudaMemcpy(h_in, d_in, n*sizeof(__half), cudaMemcpyDeviceToHost) ); for (int i=0; i<n; i++)
{
if(__half2float(h_in[i]) != 2.5)
{
cout<< "Mismatch at " << i << " Expected = 2.5 " << "Actual = " << __half2float(h_in[i]) << endl;
exit(1);
}
}
cout << "TEST PASSES" << endl;
cudaFree(d_in);
free(h_in);
return 0;
}

Half float or FP16 datatype, provides us with 2 major benefits namely :
1. Space Savings
2. Arithmetic Speedup (since FP16 runs 2X faster than FP32)

But one of the important points to be noted is that on NVIDIA’s GPU architecture (and most likely all other GPUs), to achieve good memory bandwidth performance, it is critical to use the datatype __half2 rather than __half on the GPU side. This is because :

  1. With __half and 32 threads/warp — we only hit ~64B/Load, whereas to achieve good bandwidth we need to strive to hit at least 128B/Load (this is both favorable from a L2 and a DRAM access pattern perspective). Ideally, we’d like to go for even bigger vectors loads, which can be achieved by putting several __half or __half2 in a struct and then casting them as uint4 or float4s.
  2. Also to get 2X the math performance, we need to pack 2 __half values in a the registers together to leverage the 2X FP16 arithmetic instructions (more info here here and here). The packing is needed since that’s the format which the FPU expects the data to be present.

So due to the reasons above the code changes to :

__global__ void half2_plus1(__half2 *in_array)
{
const int idx = threadIdx.x + blockDim.x*blockIdx.x;
in_array[idx] = __hadd2(in_array[idx], __float2half2_rn(1.0));
}
int main(void)
{
const int n = 64;
__half2 *h_in, *d_in; h_in = (__half2*) malloc(n*sizeof(__half2)); gpuErrchk( cudaMalloc(&d_in, n*sizeof(__half2)) ); for (int i=0; i<n; i++)
h_in[i] = __float2half2_rn(1.5);
gpuErrchk( cudaMemcpy(d_in, h_in, n*sizeof(__half2), cudaMemcpyHostToDevice) ); dim3 block_dims(2,1,1);
dim3 thread_dims(32,1,1);
half2_plus1<<<block_dims, thread_dims>>>(d_in); gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
gpuErrchk( cudaMemcpy(h_in, d_in, n*sizeof(__half2), cudaMemcpyDeviceToHost) ); for (int i=0; i<n; i++)
{
if((__high2float(h_in[i]) != 2.5) || (__low2float(h_in[i]) !=2.5))
{
cout<< "Mismatch at " << i << " Expected = 2.5 " << "Actual = " << __half2float(h_in[i].x) << " " << __half2float(h_in[i].y) << endl;
exit(1);
}
}
cout << "TEST PASSES" << endl;
cudaFree(d_in);
free(h_in);
return 0;
}

Also note, some features like __hadd2 are supported only starting from certain GPU architecture versions, so don’t forget to add “-arch” flag during nvcc compilation.