📢 Notice 📢

Please have a read first!

GPU Programming Essentials

April 23, 2026 4 minute read

Today’s workshop introduced the essentials of GPU programming and why GPUs have become such an important part of scientific computing.

I mostly understood GPUs in a general sense, something related to graphics, acceleration, or machine learning. But this lecture made the architecture and programming model much clearer. It helped me see why GPU programming is not just “faster computing,” but a completely different way of thinking about parallelism.

Why GPU Computing Matters

GPUs were originally designed for graphics tasks, where many pixels could be processed independently in parallel. Over time, they evolved into highly programmable compute devices that are now widely used in scientific computing, simulation, and machine learning.

What makes GPUs so powerful is their focus on throughput rather than low-latency execution. Unlike CPUs, which invest heavily in control logic and caches to make a single thread fast, GPUs are designed to run a very large number of threads at once. This makes them especially effective for workloads with high data parallelism.

The lecture also highlighted that GPUs usually provide much higher memory bandwidth and floating-point throughput than CPUs. That is why they are so effective for compute-heavy algorithms.

Host and Device

A very important concept in GPU programming is the distinction between the host and the device.

The host is the CPU side
The device is the GPU side

The host is responsible for:

allocating memory on the GPU
transferring data between CPU and GPU memory
launching GPU programs
synchronising execution

This was an important reminder that the GPU does not just “see” CPU memory directly. Data movement has to be handled explicitly.

HIP Programming Model

The lecture introduced HIP, AMD’s parallel computing framework, which is similar in style to CUDA.

One thing I found interesting was the thread hierarchy. Computation is organised into:

a grid
made of blocks
which contain many threads

Threads inside the same block can cooperate and synchronise, but threads in different blocks cannot directly do so.

Another key concept was the execution unit called a wavefront on AMD GPUs, which is similar to a warp on NVIDIA GPUs. A wavefront contains 64 threads, and maximum efficiency is achieved when all of them follow the same path of execution.

This leads to the idea of thread divergence. If threads in the same wavefront branch differently because of an if statement, execution becomes serialised. That means some threads are effectively idle while others execute, reducing performance.

Memory Hierarchy

The GPU memory hierarchy is also very different from the CPU model.

Global memory: large and accessible by all threads, but relatively slow
Shared memory: fast memory shared by threads within the same block
Local memory: private to each thread, but as slow as global memory
Registers: the fastest storage, but limited in number

This made it clear that performance on the GPU depends not only on parallelism, but also on how carefully memory is used.

Writing HIP Code

The lecture also introduced the basic structure of HIP programs.

GPU functions are called kernels, and they are defined with the __global__ keyword. These kernels are launched by the host and executed in parallel on the GPU.

For example, threads can determine which piece of data they should work on using built-in variables such as:

threadIdx.x
blockIdx.x
blockDim.x

This is how each thread figures out its own role in a much larger parallel computation.

Another practical point was that memory must always be allocated and deallocated explicitly on both the CPU and GPU sides. Otherwise, memory leaks can occur.

Advanced Thread Cooperation

The lecture moved beyond basic kernel launches and covered more advanced cooperation mechanisms.

Synchronisation

The __syncthreads() function allows all threads in a block to wait until everyone has reached the same point. This ensures that shared memory updates are visible to all block members before continuing.

Atomic Operations

Functions such as atomicAdd() allow safe updates to shared or global memory without race conditions. This is essential when many threads need to update the same variable.

Parallel Reduction

A classic example shown was summing an array using parallel reduction. Instead of one thread doing all the work, multiple threads compute partial sums in parallel, then combine them step by step.

Warp Shuffle

For even better optimisation, warp-level functions such as __shfl_down() allow threads within the same warp or wavefront to exchange values directly without using shared memory.

This gave me a better sense of how performance optimisation on GPUs often requires careful coordination between threads.

A Useful Clarification: What “Kernel” Means

One thing that can be confusing at first is the word kernel, because it has different meanings depending on context.

In GPU programming, a kernel means:

a function that runs on the GPU
launched by the CPU
executed in parallel by many threads

For example:

__global__ void add(int *a, int *b, int *c) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

This is completely different from an operating system kernel, such as the Linux kernel or Windows kernel.

The two terms share the same word, but they refer to entirely different ideas.

Reflection

GPU programming is not just about writing code that runs on a faster processor. It is about understanding a different computational model, one built around massive parallelism, explicit memory movement, and careful coordination between threads.

What stood out to me most was how much performance depends on both computation structure and memory behaviour. Concepts like divergence, shared memory, and synchronisation made it clear that simply moving a program onto the GPU does not automatically guarantee speedup.

Overall, this lecture made GPU programming feel much more concrete. It connected the architecture, the programming model, and the reason GPUs are so effective for scientific workloads. It also showed that learning GPU programming is really about learning how to think differently about parallel computation.

Share on

X Facebook LinkedIn Bluesky