Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.
| Task | Attempts | Points | Max | Rating | Rec. | Deadline for full points |
|---|---|---|---|---|---|---|
| I8MM2: CPU baseline | ||||||
Implement a simple sequential baseline solution. Do not try to use any form of parallelism yet; try to make it work correctly first. |
||||||
| – | – | 3 + 0 | ★ | R | 2025-10-08 at 23:59:59 | |
| I8MM3: fast CPU | ||||||
Implement a fast CPU solution using multithreading. Use instruction-level parallelism, vectorization, as well as register and cache reuse. |
||||||
| – | – | 5 + 0 | ★+ | R | 2025-10-22 at 23:59:59 | |
| I8MM4: GPU baseline | ||||||
Implement a GPU baseline solution. |
||||||
| – | – | 3 + 0 | ★ | R | 2025-11-05 at 23:59:59 | |
| I8MM5: fast GPU | ||||||
Implement a fast GPU solution using regular (portable) CUDA. |
||||||
| – | – | 3 + 0 | ★★ | R | 2025-11-19 at 23:59:59 | |
| I8MM6: Tensorcore baseline | ||||||
Implement a tensorcore baseline. Do not try to use any forms of data-reuse yet. This is a technique exercise, a valid solution must make use of the tensor cores. |
||||||
| – | – | 4 + 0 | ★★ | R | 2025-12-03 at 23:59:59 | |
| I8MM9a: CPU AVX512-VNNI | ||||||
Implement a fast CPU solution using AVX512-VNNI instructions. |
||||||
| – | – | 4 + 2 | ★★ | R | 2025-12-03 at 23:59:59 | |
| I8MM9b: SIMD GPU | ||||||
Implement a fast GPU solution using specialized SIMD instructions available on Turing.
This is a technique exercise, a valid solution must make use of the |
||||||
| – | – | 4 + 2 | ★★ | R | 2025-12-03 at 23:59:59 | |
| I8MM9c: fast Tensorcore | ||||||
Implement a fast tensorcore solution. |
||||||
| – | – | 7 + 2 | ★★★ | R | 2025-12-03 at 23:59:59 | |
You are given an m × k matrix and a k × n matrix consisting of 8-bit integers. Your task is to calculate the m × n product of the two matrices.
You need to implement the following function:
void gemm(int m, int n, int k, const int8_t* A, const int8_t* B, int32_t* C);
Here A and B are pointers to the input matrices, with m rows and k columns for A, and
k rows and n columns for B.
For all 0 <= y < m and 0 <= x < k, the element at row y and column x of matrix A is stored in A[x + y*k].
For all 0 <= y < k and 0 <= x < n, the element at row y and column x of matrix B is stored in B[x + y*n].
The function has to solve the following task:
for all i and j with 0 <= i < m and 0 <= j < n,
calculate the inner product between row i of A and column j of the B, and store the result in C[j + i*n].
The arrays data and result are already allocated by whoever calls this function; you do not need to do any memory management related to these arrays.
For the tasks that are to be solved on the CPU, A, B, and C point to CPU memory,
for tasks to be solved on the GPU they point to device memory.
You should not assume that C contains any valid values at the point of call. In particular, it is not guaranteed to be initialized with zeros.
The reduction dimension k is guaranteed to be less than 65536, so that all results can be represented as 32-bit signed integers.
While floating-point and integer matrix multiplication appear very similar, at the mirco-architectural level, there is one crucial difference: When multiplying two 32-bit floating-point numbers, the result is again a 32-bit floating-point number, that can be added to a 32-bit floating-point number. In contrast, the product of two 8-bit integers is a 16-bit integer, and if you want to add multiple of these products, the accumulator needs to be a 32-bit integer.
There cannot be a SIMD instruction that takes two vector-registers of packed 8-bit integers and
accumulates to a third register (like, e.g., VFMADDPS); the destination register is simply to small
to accumulate all 64 products, assuming 512-bit wide registers. Instead, the hardware implements inner-product
like operations: Take pairs (or groups of 4) of 8-bit integers in one operand, multiply each with the corresponding
8-bit integer in the second operand, sum the individual products and accumulate into the destination operand. This way,
the destination can contain fewer, but higher bit-width integers.
In generic AVX-512, there is one instruction for doing an 8-bit inner product over pairs of numbers with 16-bit accumulation.
This is not particularly useful, because accumulation needs to happen in 32 bits to prevent overflows.
However, a similar instruction exists for an inner product over two 16-bit numbers, with accumulation in 32 bit.
Expanding the 8-bit numbers to 16 bit and then using _mm512_madd_epi16 can be a viable strategy.
For the VNNI task, note that the available instruction _mm512_dpbusds_epi32 only
allows multiplying one signed operand and one unsigned operand. In order to reap the speed benefits of this instruction,
you thus need implement pre- and postprocessing that maps signed integer matrix multiplication to signed time unsigned
matrix multiplication. The __dp4a intrinsic in CUDA directly supports signed times signed multiplication.
A simple mental model for the basic operation of a tensorcore is that it extends the vector operations of regular SIMD processing to instructions that operate on fixed-size matrix fragments. When using 8-bit integer operands, each warp of the GPU can process the product of two 16 × 16 fragments in a single instruction. The C++ interface to these instructions is documented in the Cuda Programming Guide.
As such, you can consider the input and output matrices as built up out of 16 × 16 tiles, and the algorithm can be implemented the same way as a scalar matrix multiplication, except each element is now a matrix fragment. In particular, optimizations like register reuse (now on the level of entire fragments), shared memory, and the choice of the right data layout, remain critical for good performance.