30 Jul 2024
GPU Programming
To set up google colab, set the runtime to T4 and run
!nvcc --version
!pip install nvcc4jupyter
%load_ext nvcc4jupyter
GPU Programming basics
- Every CUDA program has its own GPU memory.
- To allocate memory on the GPU use
cudaMalloc(void **ptr, size)
; cudaFree(void *ptr)
will free the same memory.- Memory allocated on the CPU being referenced in the GPU will fail to compile.
- To copy memory from CPU to GPU, you need to use
cudaMemcpy(void *dst, void *src, size_t count, send_type)
wheresend_type
is eithercudaMemcpyHostToDevice
orcudaMemcpyDeviceToHost
.
- To allocate memory on the GPU use
- Every GPU function needs to be prefixed with
__global__
for it to be recognised by nvcc. - Every function executes on something called as a kernel. The syntax is
function<<<x, y>>>(args..)
wherex
andy
are the number of thread-blocks and the number of threads per block respectively.
Thread Organization
- A kernel is launched as a grid of threads-blocks. Each thread block is further launched as a grid of threads.
- Therefore, this acts like a 6D grid of threads. We can elide one or more dimensions for both.
A maximum of 1024 threads are supported in each of the thread block. Hence the product of the dimensions of the second argument cannot be more than 1024.
We can pass a
dim3
argument to both the parameters for the kernel that allows us to configure the three dimensions of the kernel. To access the three blocks, both size-wise and index-wise, CUDA defines variables as suchFor the full grid For the block Size indexing Size indexing gridDim.x
and the same for y, zblockIdx.x
and sameblockDim.x
threadIdx.x