Programming Parallel Computers

Courses Aalto 2024Log inHelp

Aalto 2024

Index Contest SubmissionsPre 0CP 1 2a 2b 2c 3a 3b 4 5 9aIS 2 4 6a 6b 9aMF 1 2 9aSO 4 5 6LLM 9aP 9aX 0a 0b 9a 9b

LLM: large language model

Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.

Task	Attempts	Expected	Points	Max	Rating	Rec.	Deadline for full points
LLM9a: CPU optimization
Implement efficient processing of an LLM prompt using all available single-precision processing power of the CPU.
	–	–	–	5 + 2	★★	R	2024-06-02 at 23:59:59

General instructions for this exercise

In this exercise, you will optimize an implementation of a Large Language Model, in this particular case models of the LLaMA family. A short description of the connection between parallel computation and the success of transformer-based language models such as LLaMA is given below.

Generally, the work of an LLM can be split into two phases: processing the user prompt and generating new tokens based on the prompt. In this exercise, we concentrate on the former. Already with only the user prompt processing, we can get interesting applications, like predicting how surprising a given piece of text is; see demo section below for more information.

For this exercise, we have prepared a baseline implementation of the LLaMA 2 model, based on Andrej Karpathy's llama2.c. Your task is to identify the bottlenecks, and apply the techniques you have learned in this course to speed up prompt processing.

Interface

You need to implement the following function:

void llm(LLamaConfig config, 
         LLamaParameters params, 
         const std::vector& tokens, 
         std::vector& logits);

The parameters are as follows:

config A struct that describes the configuration of the LLaMA model, in particular the shapes of the weight matrices. See llm.h for the definition of this struct.
params This struct contains the actual weights of the model.
tokens The sequence of tokens that serve as the prompt for the model.
logits This array needs to be filled in with the predicted scores (logits).

You can assume that the configuration always describes a valid network (i.e., all sizes > 0). You may assume that network dimensions (config.dim and config.hidden_dim) will be a multiple of 16.

Details

The exercise comes with a baseline implementation that handles single-threaded, scalar processing of the LLaMA model. For each token, this consists of the following steps

Take the numerical token id and select the corresponding embedding vector as the initial dim-dimensional hidden state of the network.
For each transformer block, update the hidden state of the network:

Rescale the inputs using RMSnorm with scale parameters rms_attention.
Calculate query, key, and value vectors for the attention mechanism using a matrix multiplication with query_weight_matrix, key_weight_matrix, and value_weight_matrix respectively.
Adjust query and key using rotary positional embeddings.
For each attention head, calculate the similarity between key and value at each past token using an inner product. Convert these into scores that sum to one using the softmax function.
Calculate the weighted sum of all value vectors using the scores as weights.
Concatenate the results of each attention head and perform one final matrix multiplication with out_weight_matrix, add the result to the hidden state.
Generate a new, rescaled version of the hidden state using rmsnorm.
Project this state into two separate larger vector spaces of dimension hidden_dim by multiplying with the weight matrices feed_forward_w1 and feed_forward_w3. Combine the two vectors using SwiGLU pointwise operation.
Project back to original size dim using weight matrix feed_forward_w2, add the result to the hidden state.

Perform one final normalization using rmsnorm with scaling parameters RmsFinal.
Calculate scores for each possible next token using the num_tokens × dim output weight matrix TokenOutputMatrix
Convert to probabilities using softmax.

We provide implementations of these utilities in the llm.h header. You are free to use these functions, or provide your own implementation if you think this will improve performance.

Hints

Our tests contain setups in which specific submodules of the Transformer are disabled. These might help tracking down implementation errors

Familiarize yourself with the provided reference implementation. Try to localize where most of the time is spent, and which parts of the computation are easily parallelizable. Just adding #pragma omp parallel for to the right places will give you at least one point for this exercise.

Further improvements are possible if you apply ILP and vectorization, but to achieve full points in this exercise, you need to come up with a more efficient parallelization strategy.

Demo

While generally, their ability to generate human-like text is what makes LLMs impressive, there are also interesting things that can be done purely by prompt processing. In particular, since we can compare the predicted probability distributions with the actual tokens in the text, we can find out which parts of the prompt were most surprising to the network.

In the exercise template, we provide a demo tool, invoked with ./grading demo, that visualizes how surprising each token in the given text is. For example, running ./grading demo "Once upon a time, there was a test story." produces the following output:

Once upon a time, there was a test story.

Note that this involves downloading a pre-trained network from online.

Autoregressive Language Modelling, Transformers, and Parallelism

Modern machine learning can achieve amazing results when given a large amount of labeled training data. However, most existing data is unlabeled, and still contains valuable information that can be learned within its structure. This is what an autoregressive language model is trying to capture: Given a sequence of words, predict which word is going to be the next one. More generally, the text is split into tokens, which could be words, syllables, or even single letters. All that is needed for training a task like this is a large corpus of text, which can be gathered by, e.g., scraping Wikipedia.

Possibly the simplest way to model this task is to apply a Markov assumption: The next element in the sequence only depends on the current element. In that case, the language model is just a huge probability table — given this current token, these are the probabilities for the next token. While this is extremely efficient, it is also quite limited in its modeling capacity. It can be used, for example, to generate fake words that capture the characteristics of an actual language, but it cannot generate coherent sentences.

To increase expressibility, one could lengthen the context window of the Markov model. Instead of considering just the last token, consider the last n tokens. Unfortunately, this makes the memory consumption (and required training data) grow exponentially with the context size. An alternative is to compress the context into a state vector, and make predictions based on that state. Conceptually

for token in sequence do
    state = update_state(state, token)
    prediction = output_prediction(state)

To further improve expressibility, in typical deep-learning fashion, one can than stack multiple of these processes together in layers, such that the state[l, p] of layer l at location p in the sequence, is a function of the state of the previous time-step, and of the previous layer at the same time step, state[l, p] = update_state(state[l, p-1], state[l-1, p]).

There are several challenges when trying to use this approach for longer sequences; in particular, the function update_state needs to be carefully chosen to allow signals to propagate over long time distances. If you are interested in more, Andrej Karpathy has a nice summary on the success of recurrent neural networks.

Unfortunately, one challenge remains: How to train these networks with the vast amounts of data available? The structure of the network update means that there exist dependencies both towards the preceding layer, and towards the previous state within the same layer. This is in contrast to the more recent Transformer architecture, in which the next state is calculated based on all past states of only the previous layer, removing one axis of dependencies. Consequently, while processing different layers has to be done sequentially, all the positions within a layer can be handled in parallel. As a result, we now have large language models trained on trillions of tokens.

The phenomenon that the success and failure of a machine learning method can be strongly dependent on the capabilities of the current hardware has become known as the Hardware Lottery.