Lab 3: Parallel processing
Welcome to lab 3, a lab that will introduce you to parallel processing using CUDA. You will learn how to write parallel programs that can run on a GPU, and how to optimize them for performance.
This is a preliminary version of the lab, and it is subject to changes until the lab date.
Table of contents
- Preparation
- Part 1: Matrix multiplication
- Part 2: Tiled matrix multiplication
- Part 3: Histogram shared memory
- BONUS
Preparation
Before this lab, you should have a basic understanding of C++, since it is the basis of CUDA programming. You should also have a basic understanding of parallel programming concepts, such as threads, blocks, and grids. To briefly review these concepts or see more information about CUDA programming, you can check This subset of CUDA Programming Guide, and This Even Easier Introduction to CUDA
Setup
If you have Linux and an NVIDIA GPU, we recommend you run the CUDA application locally on your computer. This will allow you to test your code faster and debug it more efficiently. Follow the CUDA local setup to install the CUDA Toolkit on your Linux machine.
If you do not have a CUDA capable GPU or Linux, you can use the DelftBlue cluster to run your code. Follow the CUDA DelftBlue cluster setup to access the cluster and run your code there.
For simplicity, this lab instructions are given for the DelftBlue cluster, but you can simplify them if you are running on your local machine
Submitting jobs on DelftBlue
After you have followed the CUDA DelftBlue cluster setup, you probably know that you can upload jobs by running the following command in the terminal: sbatch job.sh
. Each part has their own job.sh
file, so you can submit them separately. Once the job is complete you will get an output file with the results with the format ce-lab3-part<nr>-<job-id>.out
. You can use cat ce-lab3-part<nr>-<job-id>.out
to see the results. You can check the status of your job by running squeue -u $USER
.
Warning: Do not upload the
job_backup.sh
file, until explicitly told to do so by the TAs.
Part 1: Matrix multiplication
Objective
The purpose of this part is to implement a basic parallel matrix matrix multiplication algorithm in CUDA. The goal is to understand the basic concepts of CUDA programming, such as memory allocation, memory copying, kernel parameters and synchronization.
Procedure
- Inside your template folder, navigate to
square-matrix-multiplication
. Here in main.cu, you will implement a CUDA kernel for matrix multiplication. Follow the following steps:- Allocate memory for the matrices on the device (GPU).
- Copy the matrices to the device.
- Define kernel parameters and call the kernel function.
- Implement the kernel function.
- Copy the result from device to host.
- Free device memory.
Hints: Understand the
CPU matrix multiplication
inmain.cu
to implement something similar in your kernel. Refer to the lecture slides: CESE4130_9.
-
Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using
module load cuda/12.5
. Then you can usemake
to compile your code. -
Only when the code is compiled without any errors, upload a job to the DelftBlue cluster to run your code
sbatch job.sh
. Wait for the results and analyze them as explained in Submitting jobs on DelftBlue.- Inside
job.sh
you can change the line./matmult
for matrices with different sizes.1.
- Inside
Warning: Only use
verify
for small matrix sizes, as it is not optimized for large matrices.
Once your program is correct answer the following questions:
Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.
-
Examine the code output with a 16x16 matrix and compare CPU time vs. GPU kernel time. Which one is faster and why?
-
With the default BLOCK_SIZE of 32, find the number of: blocks, total threads, and inactive threads for the following cases (you don't need to run the code):
(a). 1024 x 1024 matrix size
(b). 2300 x 2300 matrix size
-
If we change the BLOCK_SIZE to 64 would the code still work? Why? You can briefly refer to the output of the
device-query
-
For
./matmult 2048
with BLOCK_SIZE values of 8 and 32. Find the number of blocks and compare the kernel times. Which is faster and why?
Once you are done answering the questions from this part, call a TA to verify your answers and proceed to the next part.
./matmult
Uses the default matrix sizes 1000 x 1000, ./matmult <m>
Uses square m x m matrices.
Part 2: Tiled matrix multiplication
Objective
The purpose of this part is to get familiar with shared memory use to optimize kernel algorithms. The goal is to implement a tiled version of the previous matrix multiplication routine in CUDA. You can learn more about shared memory here.
In this lab you will need to allocate shared memory for the tiles and copy the tiles from global memory to shared memory. You can copy unchanged parts from Part 1.
Procedure
- In the same template folder, navigate to
square-tiled-matrix-multiplication
and modify the kernel inmain.cu
to implement the tiled matrix multiplication.
Hint: The lecture slides CESE4130_10 can be very helpful for this part.
-
Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using
module load cuda/12.5
. Then you can usemake
to compile your code. -
Upload a job to the DelftBlue cluster to run your code
sbatch job.sh
, wait for the results and analyze them as explained in Submitting jobs on DelftBlue.- Inside
job.sh
you can change the line./matmult-tiled
with different sizes.2.
- Inside
Once your program is correct answer the following questions:
Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.
-
Find the speedup of tiled version vs. basic version (Part 1) for matrices with size 32, 128, 512. What is the maximum speed-up?
-
Is it true that tiled version is always faster than basic version? Why?
-
How does one handle boundary conditions in the tiled matrix multiplication when the matrix size is not a multiple of the tile size?
-
What role does synchronization play in the tiled matrix multiplication?
./matmult-tiled
Uses the default matrix sizes 1000 x 1000, ./matmult-tiled <m>
Uses square m x m matrices.
If you are having trouble with this part, you can check:
Part 3: Shared memory histogram
Objective
Now that you practiced with shared memory. You need to implement a histogram calculation in CUDA using shared memomry. This needs to be implemented using atomic operations.
Procedure
-
Inside your cloned directory, navigate to
character-histogram-shared-memory
and modify the kernel inmain.cu
to implement the histogram calculation using shared memory. -
Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using
module load cuda/12.5
. Then you can usemake
to compile your code. -
Upload a job to the DelftBlue cluster to run your code
sbatch job.sh
, wait for the results and analyze them as explained in Submitting jobs on DelftBlue.
Once your program is correct answer the following questions:
Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.
-
Why does the histogram calculation require atomic operations? What happens if they are not there?
-
How would the choice of block size affect the performance of the histogram calculation?
-
Explain the use(s) of your
__syncthreads();
Hint: look at the commented out
verify
function insupport.cu
to check how histogram calculation is done in the CPU.
Note: If you decide to stop your lab here, and you used VsCode proceed to closing down your connection to VSCode.
If you are having trouble with this part, you can check:
BONUS
Objective
The purpose of this part is to extend your tiled matrix multiplication to support non-square matrices. You will need to modify the code and the kernel in main.cu
to implement the tiled matrix multiplication for non-square matrices.
Procedure
-
Inside your template folder, navigate to
bonus/rectangular-tiled-matrix-multiplication/
, where you will, again, only need to modifymain.cu
for setting up memory allocations and kernel code. -
Once you have completed the code modifications, you can compile your code. To compile your code, first load the compiler using
module load cuda/12.5
. Then you can usemake
to compile your code. -
Upload a job to the DelftBlue cluster to run your code
sbatch job.sh
, wait for the results and analyze them as explained in Submitting jobs on DelftBlue.- Inside
job.sh
you can change the line./matmult
with different sizes.3.
- Inside
Once your program is correct answer the following question:
Warning: You must synchronize after your kernel call, otherwise the time of your kernel will be inaccurate.
- With the default BLOCK_SIZE of 32, find the number of: blocks, total threads, and inactive threads for
./matmult 1300 2 1300
.
Note: If you used VsCode proceed to closing down your connection to VSCode.
./matmult-tiled
Uses the default matrix sizes 1000 x 1000, ./matmult-tiled <m>
Uses square m x m matrices, ./matmult-tiled <m> <k> <n>
Uses (m x k) and (k x n) input matrices.