Definitions

Transformer Basics

Model

A model is a set of weights used to generate responses. At the moment, Wisent only works with open source models. Each model has a distinct parameter size and special tokens to mark the beginning of the model response and user query.

Input

Tokens flowing into the model, as well as the system prompt and additional context added to the model. An input is a string from the user like "What is the best food?". This string is processed by the model.

Output

Set of tokens coming out of the model. This is a function of the input and the model.

Token

A small part of a sentence, often consisting of parts of a word. The fundamental unit of generation for large language models. For example "The", or "fan" from the word "fanfare".

Weights

A large set of numbers determining the parameters of an individual model. The parameters of a neural network that determine how input data is transformed at each step. In most models, weights are not directly computed but optimized through a process known as training.

Layer

A single processing block in a transformer model that updates the residual stream. It typically consists of an attention mechanism and an MLP (feedforward network), each followed by a residual connection and layer normalization. In LLaMA 3.1 8B, there are 32 such layers, each refining the token representations passed through the residual stream.

Residual stream

A vector representing the token's current state flowing throughout the model. It is updated as the model moves information through the layers. For Llama 3.1. 8B it is a 1 x 4096 vector.

Activations

All intermediate values computed during a forward pass. Contains the residual stream, but also all the other information like layer norm results, MLP outputs and the like.

Embeddings

A numerical representation of the tokens entering the model.

Inference

The actual moment you are generating information from the model.

Training

The process of generating the optimal values of the weights of the large language model. Usually without specific supervision or instructions.

Representation Engineering Terminology

Representation Engineering

The practice of detecting (Representation Reading) and influencing (Representation Steering) representations present at inference time.

Contrastive Pair

A set of two strings where one string corresponds to a positive instance of a trait (contains a particular representation we want to identify e.g. "The capital of Japan is Paris", which is a hallucination) and the other to a negative one (e.g. "The capital of Japan is Tokyo", which is directly how the first pair would look like without the hallucination). This pair of contrastive positive and negative behaviour is essential to extracting the vector for hallucinations.

Contrastive Pair Set

A set of contrastive pairs used to extract a representation. For example, you can use a contrastive pair set with different truthful and hallucinated behaviour to identify the representation of hallucinations. In general, the more contrastive pairs in the set, the better the results, but after the initial sample of 5-20 contrastive pairs, the gain is minimal, especially if the representation is not complex. If you want to detect a representation of happy / sad (which is pretty simple and can be deduced from one token only), that is likely not going to require more than one contrastive pair. On the other hand, detecting a representation of good code or something domain specific for your use case is likely going to be difficult.

Activation Aggregation Method

The contrastive pair set is run on a particular model to get a set of activations. We have to then use some sort of logic to aggregate the information within each pair but also across all of them to get a representation of a particular concept.

Representation

A high level concept embedded within the weights of the neural network. To be honest, the exact definition of what a representation is can be a bit difficult. It can be really wide, like a representation of hallucination or good coding ability. It can be pretty narrow like knowledge about a particular historical fact or being able to perform a particular task. Representations get acquired in training through process known as representation learning. Representation engineering however, focuses on observing and changing representations at runtime.

Control vector

A vector that gets added (or otherwise used to adjust the activations of a large language model at runtime) to a particular layer to influence the activations and hence the words being generated.

Steering

Influencing the activations of a particular model so that it generates tokens that are better aligned with the user's preferences.

Steering Method

Method through which control vectors are applied to the model to influence the generation of tokens through steering. Different methods may involve applying steering only for some tokens based on some conditions (for example, when a hallucination is detected), with varying strength, only on some tokens etc.

Classifier

A function determining whether a representation (e.g. of hallucination or harmfulness) is present in a particular residual stream.