Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.
| Task | Attempts | Points | Max | Rating | Rec. | Deadline for full points |
|---|---|---|---|---|---|---|
| I8MM2: CPU baseline | ||||||
Implement a simple sequential baseline solution. Do not try to use any form of parallelism yet; try to make it work correctly first. |
||||||
| – | – | 3 + 0 | ★ | Rec. | 2026-10-07 at 23:59:59 | |
| I8MM3: fast CPU | ||||||
Implement a fast CPU solution using multithreading. Use instruction-level parallelism, vectorization, as well as register and cache reuse. |
||||||
| – | – | 5 + 0 | ★+ | Rec. | 2026-10-21 at 23:59:59 | |
| I8MM4: GPU baseline | ||||||
Implement a GPU baseline solution. |
||||||
| – | – | 3 + 0 | ★ | Rec. | 2026-11-04 at 23:59:59 | |
| I8MM5: fast GPU | ||||||
Implement a fast GPU solution using regular (portable) CUDA. |
||||||
| – | – | 3 + 0 | ★★ | Rec. | 2026-11-18 at 23:59:59 | |
| I8MM6: Tensor Core baseline | ||||||
Implement a tensor core baseline. Do not try to use any forms of data reuse yet. This is a technique exercise, a valid solution must make use of the tensor cores. |
||||||
| – | – | 4 + 0 | ★★ | Rec. | 2026-12-02 at 23:59:59 | |
| I8MM9a: CPU AVX512-VNNI | ||||||
Implement a fast CPU solution using AVX512-VNNI instructions. |
||||||
| – | – | 4 + 2 | ★★ | Rec. | 2026-12-02 at 23:59:59 | |
| I8MM9b: SIMD GPU | ||||||
Implement a fast GPU solution using specialized SIMD instructions available on Turing.
This is a technique exercise, a valid solution must make use of the |
||||||
| – | – | 4 + 2 | ★★ | Rec. | 2026-12-02 at 23:59:59 | |
| I8MM9c: fast Tensor Core | ||||||
Implement a fast Tensor Core solution. |
||||||
| – | – | 7 + 2 | ★★★ | Rec. | 2026-12-02 at 23:59:59 | |
You are given an m × k matrix and a k × n matrix consisting of 8-bit integers. Your task is to calculate the m × n product of the two matrices.
You need to implement the following function:
void gemm(int m, int n, int k, const int8_t* A, const int8_t* B, int32_t* C);
Here A and B are pointers to the input matrices, with m rows and k columns for A, and
k rows and n columns for B.
For all 0 <= y < m and 0 <= x < k, the element at row y and column x of matrix A is stored in A[x + y*k].
For all 0 <= y < k and 0 <= x < n, the element at row y and column x of matrix B is stored in B[x + y*n].
The function has to solve the following task:
for all i and j with 0 <= i < m and 0 <= j < n,
calculate the inner product between row i of A and column j of B, and store the result in C[j + i*n].
The arrays A, B, and C are already allocated by whoever calls this function; you do not need to do any memory management related to these arrays.
For the tasks that are to be solved on the CPU, A, B, and C point to CPU memory,
for tasks to be solved on the GPU they point to device memory.
You should not assume that C contains any valid values at the point of call. In particular, it is not guaranteed to be initialized with zeros.
The reduction dimension k is guaranteed to be less than 65536, so that all results can be represented as 32-bit signed integers.
While floating-point and integer matrix multiplication appear very similar, at the microarchitectural level, there is one crucial difference: When multiplying two 32-bit floating-point numbers, the result is again a 32-bit floating-point number that can be added to a 32-bit floating-point number. In contrast, the product of two 8-bit integers is a 16-bit integer, and if you want to add multiple such products, the accumulator needs to be a 32-bit integer.
There cannot be a SIMD instruction that takes two vector registers of packed 8-bit integers and
accumulates to a third register (for example, VFMADDPS); the destination register is simply too small
to accumulate all 64 products, assuming 512-bit wide registers. Instead, the hardware implements inner-product-like operations: Take pairs (or groups of 4) of 8-bit integers in one operand, multiply each with the corresponding
8-bit integer in the second operand, sum the individual products and accumulate into the destination operand. This way,
the destination can contain fewer integers, but with higher bit widths.
In generic AVX-512, there is one instruction for doing an 8-bit inner product over pairs of numbers with 16-bit accumulation.
This is not particularly useful, because accumulation needs to happen in 32 bits to prevent overflows.
However, a similar instruction exists for an inner product over two 16-bit numbers, with accumulation in 32 bits.
Expanding the 8-bit numbers to 16 bit and then using _mm512_madd_epi16 can be a viable strategy.
For the VNNI task, note that the available instruction _mm512_dpbusds_epi32 only
allows multiplying one signed operand and one unsigned operand. In order to reap the speed benefits of this instruction,
you thus need to implement pre- and postprocessing that maps signed integer matrix multiplication to signed times unsigned
matrix multiplication. The __dp4a intrinsic in CUDA directly supports signed times signed multiplication.
A simple mental model for the basic operation of a tensor core is that it extends the vector operations of regular SIMD processing to instructions that operate on fixed-size matrix fragments. When using 8-bit integer operands, each warp of the GPU can process the product of two 16 × 16 fragments in a single instruction. The C++ interface to these instructions is documented in the CUDA Programming Guide.
As such, you can consider the input and output matrices as built up out of 16 × 16 tiles, and the algorithm can be implemented the same way as a scalar matrix multiplication, except each element is now a matrix fragment. In particular, optimizations like register reuse (now on the level of entire fragments), shared memory, and the choice of the right data layout, remain critical for good performance.