Programming Parallel Computers

Open 2025

I8MM: integer matrix multiplication

Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.

Task Attempts Points Max Rating Rec. Deadline for full points
I8MM2: CPU baseline

Implement a simple sequential baseline solution. Do not try to use any form of parallelism yet; try to make it work correctly first.

3 2025-10-08 at 23:59:59
I8MM3: fast CPU

Implement a fast CPU solution using multithreading. Use instruction-level parallelism, vectorization, as well as register and cache reuse.

5 ★+ 2025-10-22 at 23:59:59
I8MM4: GPU baseline

Implement a GPU baseline solution.

3 2025-11-05 at 23:59:59
I8MM5: fast GPU

Implement a fast GPU solution using regular (portable) CUDA.

3 ★★ 2025-11-19 at 23:59:59
I8MM6: Tensorcore baseline

Implement a tensorcore baseline. Do not try to use any forms of data-reuse yet. This is a technique exercise, a valid solution must make use of the tensor cores.

4 ★★ 2025-12-03 at 23:59:59
I8MM9a: CPU AVX512-VNNI

Implement a fast CPU solution using AVX512-VNNI instructions.

4 + 2 ★★ 2025-12-03 at 23:59:59
I8MM9b: SIMD GPU

Implement a fast GPU solution using specialized SIMD instructions available on Turing. This is a technique exercise, a valid solution must make use of the __dp4a intrinsic.

4 + 2 ★★ 2025-12-03 at 23:59:59
I8MM9c: fast Tensorcore

Implement a fast tensorcore solution.

7 + 2 ★★★ 2025-12-03 at 23:59:59

General instructions for this exercise

You are given an m × k matrix and a k × n matrix consisting of 8-bit integers. Your task is to calculate the m × n product of the two matrices.

Interface

You need to implement the following function:

void gemm(int m, int n, int k, const int8_t* A, const int8_t* B, int32_t* C);

Here A and B are pointers to the input matrices, with m rows and k columns for A, and k rows and n columns for B. For all 0 <= y < m and 0 <= x < k, the element at row y and column x of matrix A is stored in A[x + y*k]. For all 0 <= y < k and 0 <= x < n, the element at row y and column x of matrix B is stored in B[x + y*n].

The function has to solve the following task: for all i and j with 0 <= i < m and 0 <= j < n, calculate the inner product between row i of A and column j of the B, and store the result in C[j + i*n].

The arrays data and result are already allocated by whoever calls this function; you do not need to do any memory management related to these arrays. For the tasks that are to be solved on the CPU, A, B, and C point to CPU memory, for tasks to be solved on the GPU they point to device memory. You should not assume that C contains any valid values at the point of call. In particular, it is not guaranteed to be initialized with zeros.

Details

The reduction dimension k is guaranteed to be less than 65536, so that all results can be represented as 32-bit signed integers.

While floating-point and integer matrix multiplication appear very similar, at the mirco-architectural level, there is one crucial difference: When multiplying two 32-bit floating-point numbers, the result is again a 32-bit floating-point number, that can be added to a 32-bit floating-point number. In contrast, the product of two 8-bit integers is a 16-bit integer, and if you want to add multiple of these products, the accumulator needs to be a 32-bit integer.

There cannot be a SIMD instruction that takes two vector-registers of packed 8-bit integers and accumulates to a third register (like, e.g., VFMADDPS); the destination register is simply to small to accumulate all 64 products, assuming 512-bit wide registers. Instead, the hardware implements inner-product like operations: Take pairs (or groups of 4) of 8-bit integers in one operand, multiply each with the corresponding 8-bit integer in the second operand, sum the individual products and accumulate into the destination operand. This way, the destination can contain fewer, but higher bit-width integers.

SIMD Hints

In generic AVX-512, there is one instruction for doing an 8-bit inner product over pairs of numbers with 16-bit accumulation. This is not particularly useful, because accumulation needs to happen in 32 bits to prevent overflows. However, a similar instruction exists for an inner product over two 16-bit numbers, with accumulation in 32 bit. Expanding the 8-bit numbers to 16 bit and then using _mm512_madd_epi16 can be a viable strategy.

For the VNNI task, note that the available instruction _mm512_dpbusds_epi32 only allows multiplying one signed operand and one unsigned operand. In order to reap the speed benefits of this instruction, you thus need implement pre- and postprocessing that maps signed integer matrix multiplication to signed time unsigned matrix multiplication. The __dp4a intrinsic in CUDA directly supports signed times signed multiplication.

Tensorcore Hints

A simple mental model for the basic operation of a tensorcore is that it extends the vector operations of regular SIMD processing to instructions that operate on fixed-size matrix fragments. When using 8-bit integer operands, each warp of the GPU can process the product of two 16 × 16 fragments in a single instruction. The C++ interface to these instructions is documented in the Cuda Programming Guide.

As such, you can consider the input and output matrices as built up out of 16 × 16 tiles, and the algorithm can be implemented the same way as a scalar matrix multiplication, except each element is now a matrix fragment. In particular, optimizations like register reuse (now on the level of entire fragments), shared memory, and the choice of the right data layout, remain critical for good performance.