Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.
Task  Attempts  Points  Max  Rating  Man.  Deadline for full points 

CP1: CPU baseline  
Implement a simple sequential baseline solution. Do not try to use any form of parallelism yet; try to make it work correctly first. Please do all arithmetic with doubleprecision floating point numbers. For this initial exercise, we have disabled autovectorization. 

–  –  5 + 0  ★  M  20240831 at 23:59:59  
CP2a: instructionlevel parallelism  
Parallelize your solution to CP1 by exploiting instructionlevel parallelism. Make sure that the performancecritical operations are pipelined efficiently. Do not use any other form of parallelism yet in this exercise. Please do all arithmetic with doubleprecision floating point numbers.


–  –  3 + 0  ★  M  20240831 at 23:59:59  
CP2b: multicore parallelism  
Parallelize your solution to CP1 with the help of OpenMP and multithreading so that you are exploiting multiple CPU cores in parallel. Do not use any other form of parallelism yet in this exercise. Please do all arithmetic with doubleprecision floating point numbers.


–  –  3 + 0  ★  M  20240831 at 23:59:59  
CP2c: vectorization  
Parallelize your solution to CP1 with the help of vector operations so that you can perform multiple useful arithmetic operations with one instruction. Do not use any other form of parallelism yet in this exercise. Please do all arithmetic with doubleprecision floating point numbers.


–  –  3 + 0  ★  M  20240831 at 23:59:59  
CP3a: fast solution with doubles  
Using all resources that you have in the CPU, solve the task as fast as possible. You are encouraged to exploit instructionlevel parallelism, multithreading, and vector instructions whenever possible, and also to optimize the memory access pattern. Please do all arithmetic with doubleprecision floating point numbers. 

–  –  5 + 0  ★★  M  20240831 at 23:59:59  
CP3b: fast solution with floats  
Using all resources that you have in the CPU, solve the task as fast as possible. You are encouraged to exploit instructionlevel parallelism, multithreading, and vector instructions whenever possible, and also to optimize the memory access pattern. In this task, you are permitted to use singleprecision floating point numbers. 

–  –  5 + 0  ★★  M  20240831 at 23:59:59  
CP4: GPU baseline  
Implement a simple baseline solution for the GPU. Make sure it works correctly and that it is reasonably efficient. Make sure that all performancecritical parts are executed on the GPU; you can do some lightweight preprocessing and postprocessing also on the CPU. Remember to check all CUDA operations for errors. In this task, you are permitted to use singleprecision floating point numbers. 

–  –  5 + 0  ★  M  20240831 at 23:59:59  
CP5: fast GPU solution  
Using all resources that you have in the GPU, solve the task as fast as possible. In this task, you are permitted to use singleprecision floating point numbers. 

–  –  10 + 0  ★★  M  20240831 at 23:59:59  
CP9a: better algorithm  
Try to use Strassen's algorithm to speed up matrix multiplication. Reimplement your solution to CP3b so that it uses the basic idea of Strassen's algorithm. It is sufficient to use the basic idea of Strassen's algorithm (adapted to our task as needed) in the topmost levels of recursion; you can then fall back to the naive algorithm. A pure implementation of the naive algorithm, even if its is sufficiently fast, is not a valid solution for this exercise. Care is needed with rounding errors; you may need to resort to doubleprecision arithmetic to pass the tests. 

–  –  5 + 0  ★★★  M  20240831 at 23:59:59 
You are given m input vectors, each with n numbers. Your task is to calculate the correlation between every pair of input vectors.
You need to implement the following function:
void correlate(int ny, int nx, const float* data, float* result)
Here data
is a pointer to the input matrix, with ny
rows and nx
columns. For all 0 <= y < ny
and 0 <= x < nx
, the element at row y
and column x
is stored in data[x + y*nx]
.
The function has to solve the following task: for all i
and j
with 0 <= j <= i < ny
, calculate the correlation coefficient between row i
of the input matrix and row j
of the input matrix, and store the result in result[i + j*ny]
.
Note that the correlations are symmetric, so we will only compute the upper triangle of the result matrix. You can leave the lower triangle i < j
undefined.
The arrays data
and result
are already allocated by whoever calls this function; you do not need to do any memory management related to these arrays. You should not assume that result
contains any valid values at the point of call. In particular, it is not guaranteed to be initialized with zeros.
The input and output are always given as singleprecision floating point numbers (type float
). However, depending on the task, we will do arithmetic with either single or double precision numbers:
double
, all intermediate results must be stored in variables of type double
, and you will only round the final result to single precision.float
whenever possible.However, in each case you will have to make sure the numerical precision of the results is sufficiently high. The grading tool makes sure the error is sufficiently small. The error thresholds are chosen so that a straightforward and efficient solution is typically good enough, but please feel free to ask the course staff for hints if you are struggling with the rounding errors.
These examples show what a correct implementation will do if you apply it to a bitmap image:
A reasonable way to calculate all pairwise correlations is the following:
Now matrix Y contains all pairwise correlations. The only computationallyintensive part is the computation of the matrix product; the normalizations can be done in linear time in the input size.