Activations

Activations are all intermediate values computed during a forward pass. Contains the residual stream, but also all the other information like layer norm results, MLP outputs and the like.

When you input text like "The cat sat", the model converts it to tokens, then to vectors of numbers. The model does this through embeddings. Embeddings are learned lookup tables that convert each token into a fixed numerical vector - "cat" always gets the same vector regardless of context. Each transformer layer then processes these vectors and produces new vectors called activations, which encode what the model "understands" about the text at that layer. For example, for a phrase like "Wisent is the best startup out there", the typical tokenizer would split it into nine tokens ["Wis", "ent", "is", "the", "best", "start", "up", "out", "there"], so for Llama 3.1 8B that would result in total activations = 9 tokens × 32 layers × 4096 dimensions = ~1.2 million individual activation values. So activations are sort of like embeddings but dynamic and depending on the context of the entire sentence.

Activations are what happens when the model tries to generate a specific token. A token is a part of a word like "Hello" or "world". Activations are layer-level values from a specific model trying to generate a specific token.

So when you have a model output a phrase like "I am a large language model", there are many activations you can extract and use for representation engineering. How many exactly? You can get that by multiplying the number of tokens by the number of layers. Each of those is a vector of numbers. For Llama 3.1-8B-Instruct this is a 1 x 4096 vector.

Wisent-Guard supports capturing all of those so that we can use this information to create detailed representation analysis and change those activations in the process of steering. However, using all of them is extremely computationally intense. Instead of using them all, we use one set of activations from a specific token.

To do this, we define a specific activation collection method. This allows us to create multiple ways through which we are able to collect activations corresponding to different representations. Ideally, we want two vectors that represent behavior corresponding to a representation we want to identify and the opposite of it. So two sets of numbers.

We extract activations from a contrastive pair set. A contrastive pair consists of a question, good response and bad response. We can use different logic to transform this set of strings into activations we will use to create representations from. Wisent-Guard allows you to specify what collection method to use using the Activation Collection Method primitive.

Implementation Details

For a complete understanding of how activations work in Wisent-Guard, including the full implementation of aggregation methods, similarity calculations, monitoring logic, and contrastive pair collection, explore the source code:

View activations.py on GitHub

Continue to Activation Collection Method