30 Jul 2024

GPU Programming

To set up google colab, set the runtime to T4 and run

!nvcc --version

!pip install nvcc4jupyter

%load_ext nvcc4jupyter

GPU Programming basics

Every CUDA program has its own GPU memory.
- To allocate memory on the GPU use cudaMalloc(void **ptr, size);
- cudaFree(void *ptr) will free the same memory.
- Memory allocated on the CPU being referenced in the GPU will fail to compile.
- To copy memory from CPU to GPU, you need to use cudaMemcpy(void *dst, void *src, size_t count, send_type) where send_type is either cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost.
Every GPU function needs to be prefixed with __global__ for it to be recognised by nvcc.
Every function executes on something called as a kernel. The syntax is function<<<x, y>>>(args..) where x and y are the number of thread-blocks and the number of threads per block respectively.

A kernel is launched as a grid of threads-blocks. Each thread block is further launched as a grid of threads.
Therefore, this acts like a 6D grid of threads. We can elide one or more dimensions for both.
A maximum of 1024 threads are supported in each of the thread block. Hence the product of the dimensions of the second argument cannot be more than 1024.

We can pass a dim3 argument to both the parameters for the kernel that allows us to configure the three dimensions of the kernel. To access the three blocks, both size-wise and index-wise, CUDA defines variables as such

For the full grid For the block

Size indexing Size indexing

gridDim.x and the same for y, z blockIdx.x and same blockDim.x threadIdx.x