CUDA Libs Intro : CuBLAS

Ion Thruster
5 min readNov 16, 2019

In this section/article I would like to introduce the popular cuBLAS library. “BLAS” here stands for “Basic Linear Algebra Subroutines” which basically is a specification for a bunch of Linear-Algebra computations which are commonly used.

Motivation :
You might be thinking — why use a library for “basic” linear algebra computations like matrix multiply ? to give you a perspective, this is the performance difference you’ll observe between a naive implementation on the CPU vs using a library like cuBLAS !.

>>>  /usr/local/cuda/bin/nvcc cublas_matmul.cu -o matmul -lcublas -lcurand  -std=c++11
>>> ./matmul
TEST COMPLETED
Naive CPU Time : 8777.51 ms
cuBLAS GPU TIme : 0.304928 ms
Speedup : 28785x !!

With thousands man-hours spent over the span of the past 3–4 decades optimizing BLAS, it should come as a no brainer that the gains are Extraordinary !! and you should use libraries like cuBLAS as and when possible. Now to make the point on why on GPUs, I hope you can use articles like this to convince yourself. Or you can have a peek at this performance measurements reported by NVIDIA Corp.

Background :
Optimized Linear Algebra computations have been around for a long time now, but a specification for a subset of them was standardized (in 1979) so that different hardware vendors /library writers can conform to this higher level API while having different lower level implementations[1]. the BLAS specification is split into 3 levels (depending on inputs being vectors or matrices) and the reference implementation available for free here [2]. Apart from the reference implementation there are about 2 dozen or so implementations of BLAS today which are primarily differentiated either by the Hardware Processor they are optimized for (Intel, NVIDIA, ARM, POWER etc.) or by language or operating systems etc. In this article we focus on cuBLAS which is the “official” NVIDIA library for doing BLAS on NVIDIA GPUs.

cuBLAS :
There are actually several version of the cuBLAS library :
1. Legacy cuBLAS library (no longer recommended)
2. cuBLAS_v2 — the example which we cover below.
3. cuBLASXT — Multi-GPU BLAS (only level 3 BLAS)
4. cuBLASLt — Lightweight, covers only GEMM

In this example below we focus on one of the LEVEL 3 BLAS routines called SGEMM (Single Precision General Matrix Multiply) as an example. Looking at the SGEMM specification we see :

// SGEMM  performs one of the matrix-matrix operations
//
// C := alpha * op( A ) * op( B ) + beta * C
//
// where op( X ) is one of :
// op( X ) = X or op( X ) = X**T,
// alpha and beta are scalars, and A, B and C are matrices, with op( A ) an "m by k" matrix, op( B ) a "k by n" matrix and C an "m by n" matrix.

Now to use cuBLAS API, at least three basic steps need to be done (apart from setting up the input data).
1. Initialize the handle to the cuBLAS library context using cublasCreate()
2. Call one of available cuBLAS routines
3. Destroy the handle after finishing usage

One important point to note is that the “handle” created in step 1 can be re-used as many time as needed (with different data inputs). Also it being a costly step (wrt performance) — the handle is recommended to be reused, as long some context parameters don’t change (say device ID to run on). So in terms of code — that translates to just ~4 lines ! That’s it !

// STEP 1: Create cuBLAS Handle  
cublasHandle_t handle;
cublasCreate(&handle);
// STEP 2 : Call SGEMM
cublasSgemm(handle, ..<options>..);
// STEP 3 : Destroy Handle
cublasDestroy(handle);

Now the rest of the code is basically structured to feed the API, and I’ve also taken the opportunity to use the cuRAND API [which I will cover in another article] which helps generate random numbers on the GPU to initialize the input data. Finally to simplify the code, I’ve utilized the “Thrust” library which I plan to covered in a future article. So the overall code goes like this :

#define DATATYPE float
#define A_T false // A is col Major
#define B_T false // B is col Major
#define C_T false // C is col Major
// Fill Values using curand
void init_vals(DATATYPE *in, int N)
{
curandGenerator_t prng;
CURAND_CALL( curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT) );
CURAND_CALL( curandSetPseudoRandomGeneratorSeed(prng, 1234ULL) );
CURAND_CALL( curandGenerateUniform(prng, in, N) );
CURAND_CALL( curandDestroyGenerator(prng) );
}
// Cublas call
void cublas_matmul(const DATATYPE *A, const DATATYPE *B, DATATYPE *C, const int m, const int n, const int k)
{
int lda = A_T ? k : m;
int ldb = B_T ? n : k;
int ldc = C_T ? n : m;
const DATATYPE alpha = 1;
const DATATYPE beta = 0;

// STEP 1: Create cuBLAS Handle
cublasHandle_t handle;
CUBLAS_CALL( cublasCreate(&handle) );

// STEP 2 : Call cuBLAS command
CUBLAS_CALL( cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc) );
// STEP 3 : Destroy Handle
CUBLAS_CALL( cublasDestroy(handle) );
}
int main () {
// Create Device Vectors (memory allocation)
thrust::device_vector<DATATYPE> d_A(MATRIX_M * MATRIX_K); thrust::device_vector<DATATYPE> d_B(MATRIX_K * MATRIX_N); thrust::device_vector<DATATYPE> d_C(MATRIX_M * MATRIX_N);
// Init values using CURAND
init_vals(thrust::raw_pointer_cast(d_A.data()), MATRIX_M * MATRIX_K);
init_vals(thrust::raw_pointer_cast(d_B.data()), MATRIX_K * MATRIX_N);
// Call cublas
cublas_matmul(thrust::raw_pointer_cast(d_A.data()), thrust::raw_pointer_cast(d_B.data()), thrust::raw_pointer_cast(d_C.data()), MATRIX_M, MATRIX_N, MATRIX_K);
// Copy back results to CPU
thrust::host_vector<DATATYPE> h_C_computed(MATRIX_M * MATRIX_N);
h_C_computed = d_C;
// Do something with the computed result
...
}

Nice and simple right ? This is one of the reasons I love using CUDA C++ and NVIDIA libraries, the focus is on simplicity and clarity ! both of which are essential for beginners (specially) and experienced programmers alike !

Note : The CUBLAS_CALL and CURAND_CALL in the code above are just short macros which help with error checking. Also one point to note is that while Thrust adds simplicity, it might introduce some undesired overheads if you are not careful with it, so once you gain some experience with profiling tools like “nvprof” you can decide for yourself if you want to use “Thrust” or handle the memory allocation and copying yourself.

Summary :
As you’ve seen above, the cuBLAS library is pretty easy to use and the performance benefits far out-weight the additional work involved in using it. So I hope this short introduction motivates you to get started on using cuBLAS in your own projects — the full code sample along with code to measure performance can be found here [4].

References :
[1] : https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
[2] : http://www.netlib.org/blas/
[3] : https://docs.nvidia.com/cuda/cublas/index.html
[4] : https://github.com/IonThruster/CudaTutorials/blob/master/cublas_matmul.cu
[5] : https://developer.nvidia.com/cublas

#cuda #cuda-libraries #cuBLAS #thrust #cuRAND #NVIDIA

--

--