Lab 3: Parallel processing

Welcome to lab 3, a lab that will introduce you to parallel processing using CUDA. You will learn how to write parallel programs that can run on a GPU, and how to optimize them for performance.

This is a preliminary version of the lab, and it is subject to changes until the lab date.

Preparation
- Setup
- Submitting jobs on DelftBlue
Part 1: Matrix multiplication
Part 2: Tiled matrix multiplication
Part 3: Histogram shared memory
BONUS

Preparation

Before this lab, you should have a basic understanding of C++, since it is the basis of CUDA programming. You should also have a basic understanding of parallel programming concepts, such as threads, blocks, and grids. To briefly review these concepts or see more information about CUDA programming, you can check This subset of CUDA Programming Guide, and This Even Easier Introduction to CUDA

Setup

If you have Linux and an NVIDIA GPU, we recommend you run the CUDA application locally on your computer. This will allow you to test your code faster and debug it more efficiently. Follow the CUDA local setup to install the CUDA Toolkit on your Linux machine.

If you do not have a CUDA capable GPU or Linux, you can use the DelftBlue cluster to run your code. Follow the CUDA DelftBlue cluster setup to access the cluster and run your code there.

For simplicity, this lab instructions are given for the DelftBlue cluster, but you can simplify them if you are running on your local machine

Submitting jobs on DelftBlue

After you have followed the CUDA DelftBlue cluster setup, you probably know that you can upload jobs by running the following command in the terminal: sbatch job.sh. Each part has their own job.sh file, so you can submit them separately. Once the job is complete you will get an output file with the results with the format ce-lab3-part<nr>-<job-id>.out. You can use cat ce-lab3-part<nr>-<job-id>.out to see the results. You can check the status of your job by running squeue -u $USER.

Warning: Do not upload the job_backup.sh file, until explicitly told to do so by the TAs.

Part 1: Matrix multiplication

Objective

The purpose of this part is to implement a basic parallel matrix matrix multiplication algorithm in CUDA. The goal is to understand the basic concepts of CUDA programming, such as memory allocation, memory copying, kernel parameters and synchronization.

Procedure

Inside your template folder, navigate to square-matrix-multiplication. Here in main.cu, you will implement a CUDA kernel for matrix multiplication. Follow the following steps:
1. Allocate memory for the matrices on the device (GPU).
2. Copy the matrices to the device.
3. Define kernel parameters and call the kernel function.
1. Implement the kernel function.
2. Copy the result from device to host.
3. Free device memory.

Hints: Understand the CPU matrix multiplication in main.cu to implement something similar in your kernel. Refer to the lecture slides: CESE4130_9.

Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using module load cuda/12.5. Then you can use make to compile your code.
Only when the code is compiled without any errors, upload a job to the DelftBlue cluster to run your code sbatch job.sh. Wait for the results and analyze them as explained in Submitting jobs on DelftBlue.
- Inside job.sh you can change the line ./matmult for matrices with different sizes.¹.

Warning: Only use verify for small matrix sizes, as it is not optimized for large matrices.

Once your program is correct answer the following questions:

Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.

Examine the code output with a 16x16 matrix and compare CPU time vs. GPU kernel time. Which one is faster and why?
With the default BLOCK_SIZE of 32, find the number of: blocks, total threads, and inactive threads for the following cases (you don't need to run the code):

(a). 1024 x 1024 matrix size

(b). 2300 x 2300 matrix size
If we change the BLOCK_SIZE to 64 would the code still work? Why? You can briefly refer to the output of the device-query
For ./matmult 2048 with BLOCK_SIZE values of 8 and 32. Find the number of blocks and compare the kernel times. Which is faster and why?

Once you are done answering the questions from this part, call a TA to verify your answers and proceed to the next part.

./matmult Uses the default matrix sizes 1000 x 1000, ./matmult <m> Uses square m x m matrices.

Part 2: Tiled matrix multiplication

Objective

The purpose of this part is to get familiar with shared memory use to optimize kernel algorithms. The goal is to implement a tiled version of the previous matrix multiplication routine in CUDA. You can learn more about shared memory here.

In this lab you will need to allocate shared memory for the tiles and copy the tiles from global memory to shared memory. You can copy unchanged parts from Part 1.

Procedure

In the same template folder, navigate to square-tiled-matrix-multiplication and modify the kernel in main.cu to implement the tiled matrix multiplication.

Hint: The lecture slides CESE4130_10 can be very helpful for this part.

Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using module load cuda/12.5. Then you can use make to compile your code.
Upload a job to the DelftBlue cluster to run your code sbatch job.sh, wait for the results and analyze them as explained in Submitting jobs on DelftBlue.
- Inside job.sh you can change the line ./matmult-tiled with different sizes.².

Once your program is correct answer the following questions:

Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.

Find the speedup of tiled version vs. basic version (Part 1) for matrices with size 32, 128, 512. What is the maximum speed-up?
Is it true that tiled version is always faster than basic version? Why?
How does one handle boundary conditions in the tiled matrix multiplication when the matrix size is not a multiple of the tile size?
What role does synchronization play in the tiled matrix multiplication?

./matmult-tiled Uses the default matrix sizes 1000 x 1000, ./matmult-tiled <m> Uses square m x m matrices.

If you are having trouble with this part, you can check:

This further subset of CUDA programming guide.

These Lecture slides

This blogpost

Part 3: Shared memory histogram

Objective

Now that you practiced with shared memory. You need to implement a histogram calculation in CUDA using shared memomry. This needs to be implemented using atomic operations.

Procedure

Inside your cloned directory, navigate to character-histogram-shared-memory and modify the kernel in main.cu to implement the histogram calculation using shared memory.
Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using module load cuda/12.5. Then you can use make to compile your code.
Upload a job to the DelftBlue cluster to run your code sbatch job.sh, wait for the results and analyze them as explained in Submitting jobs on DelftBlue.

Once your program is correct answer the following questions:

Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.

Why does the histogram calculation require atomic operations? What happens if they are not there?
How would the choice of block size affect the performance of the histogram calculation?
Explain the use(s) of your __syncthreads();

Hint: look at the commented out verify function in support.cu to check how histogram calculation is done in the CPU.

Note: If you decide to stop your lab here, and you used VsCode proceed to closing down your connection to VSCode.

If you are having trouble with this part, you can check:

This NVIDIA blogpost

Inside your template folder, navigate to bonus/rectangular-tiled-matrix-multiplication/, where you will, again, only need to modify main.cu for setting up memory allocations and kernel code.
Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using module load cuda/12.5. Then you can use make to compile your code.
Upload a job to the DelftBlue cluster to run your code sbatch job.sh, wait for the results and analyze them as explained in Submitting jobs on DelftBlue.
- Inside job.sh you can change the line ./matmult with different sizes.³.

Once your program is correct answer the following question:

Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.

With the default BLOCK_SIZE of 32, find the number of: blocks, total threads, and inactive threads for ./matmult 1300 2 1300.

Note: If you used VsCode proceed to closing down your connection to VSCode.

./matmult-tiled Uses the default matrix sizes 1000 x 1000, ./matmult-tiled <m> Uses square m x m matrices, ./matmult-tiled <m> <k> <n> Uses (m x k) and (k x n) input matrices.

Computer Engineering

Lab 3: Parallel processing

Table of contents

Preparation

Setup

Submitting jobs on DelftBlue

Part 1: Matrix multiplication

Objective

Procedure

Once your program is correct answer the following questions:

Part 2: Tiled matrix multiplication

Objective

Procedure

Once your program is correct answer the following questions:

Part 3: Shared memory histogram

Objective

Procedure

Once your program is correct answer the following questions:

BONUS

Objective

Procedure

Once your program is correct answer the following question: