InnocentZero's Treasure Chest

HomeFeedAbout Me

30 Jul 2024

GPU Programming

To set up google colab, set the runtime to T4 and run

!nvcc --version

!pip install nvcc4jupyter

%load_ext nvcc4jupyter

GPU Programming basics

  • Every CUDA program has its own GPU memory.
    • To allocate memory on the GPU use cudaMalloc(void **ptr, size);
    • cudaFree(void *ptr) will free the same memory.
    • Memory allocated on the CPU being referenced in the GPU will fail to compile.
    • To copy memory from CPU to GPU, you need to use cudaMemcpy(void *dst, void *src, size_t count, send_type) where send_type is either cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost.
  • Every GPU function needs to be prefixed with __global__ for it to be recognised by nvcc.
  • Every function executes on something called as a kernel. The syntax is function<<<x, y>>>(args..) where x and y are the number of thread-blocks and the number of threads per block respectively.

Thread Organization

  • A kernel is launched as a grid of threads-blocks. Each thread block is further launched as a grid of threads.
  • Therefore, this acts like a 6D grid of threads. We can elide one or more dimensions for both.
  • A maximum of 1024 threads are supported in each of the thread block. Hence the product of the dimensions of the second argument cannot be more than 1024.

    We can pass a dim3 argument to both the parameters for the kernel that allows us to configure the three dimensions of the kernel. To access the three blocks, both size-wise and index-wise, CUDA defines variables as such

    For the full grid   For the block  
    Size indexing Size indexing
    gridDim.x and the same for y, z blockIdx.x and same blockDim.x threadIdx.x
Tags: programming CUDA parallel

Other posts
Creative Commons License
This website by Md Isfarul Haque is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.