Activation Collection Method

The Activation Collection Method is the process of collecting activation information within and across contrastive pairs to identify representations.

Each contrastive pair consists of a question, good response and bad response. Activation collection method shows how for a given model and layer we extract a vector of positive and negative behaviour from those sets. Prompt construction strategies specify how we turn this set into a prompt reflecting the thinking of the model. Token targeting strategies control which token we are extracting the activations from. We have constructed a variety of methods for these purposes that are available as presets for you to use. You are welcome to design more prompt construction strategies and token targeting strategies.

When the default settings are used (multiple_choice prompt strategy + choice_token targeting strategy), the system extracts activations from the choice tokens "A" or "B". For example, with a prompt like "Which is better: What is 2+2? A. 5 B. 4", the system creates two versions: Positive prompt ending with "B" → extracts activations from the "B" token, and Negative prompt ending with "A" → extracts activations from the "A" token. The system searches backwards from the end of the sequence to find these specific choice tokens, and if it can't find them, it falls back to using the last token in the sequence.

Prompt Construction Strategies

Multiple Choice (Default)

Format: "Which is better: {question} A. {bad_response} B. {good_response}"

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --prompt-construction-strategy multiple_choice --verbose

Generated Prompt:

Which is better: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: A. This is incorrect B. 1

Role Playing

Format: "Behave like a person that would answer {question} with {response}"

Actual Example:

python -m wisent_guard.cli tasks truthfulqa_mc1 --model meta-llama/Llama-3.1-8B-Instruct --layer -1 --limit 1 --prompt-construction-strategy role_playing --verbose

Generated Prompt:

Behave like a person that would answer Q: What is the smallest country in the world that is at least one square mile in area?
A: with Nauru is the smallest country in the world that is at least one square mile in area.

Direct Completion

Format: "{question}"

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --prompt-construction-strategy direct_completion --verbose

Generated Positive Prompt:

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: 1

Generated Negative Prompt:

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: This is incorrect

Instruction Following

Format: "[INST] {question} [/INST]"

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --prompt-construction-strategy instruction_following --verbose

Generated Positive Prompt:

[INST] Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: [/INST] 1

Generated Negative Prompt:

[INST] Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: [/INST] This is incorrect

Token Targeting Strategies

Choice Token (Default)

Targets specific choice tokens like "A" or "B"

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy choice_token --verbose

Activation Extraction: Searches backwards for "B" token (correct choice), extracts activations from that token position.

Last Token

Extracts from the final token in the sequence

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy last_token --verbose

Activation Extraction: Always uses the last token in the sequence (position -1).

First Token

Extracts from the first token in the sequence

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy first_token --verbose

Activation Extraction: Always uses the first token in the sequence (position 0).

Mean Pooling

Averages activations across all tokens

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy mean_pooling --verbose

Activation Extraction: Computes mean of hidden states across all token positions in the sequence.

Max Pooling

Takes maximum activation values across tokens

Actual Example:

python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy max_pooling --verbose

Activation Extraction: Computes element-wise maximum of hidden states across all token positions.

Continuation Token

Targets specific continuation tokens like "I" or "The"

Actual Example:

python -m wisent_guard.cli tasks truthfulqa_mc1 --model meta-llama/Llama-3.1-8B-Instruct --layer -1 --limit 1 --token-targeting-strategy continuation_token --verbose

Activation Extraction: Searches forwards for "I" token (continuation token), extracts activations from that position.

Usage Examples

# Multiple choice with choice token targeting (default)
python -m wisent_guard.cli tasks mmlu \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 --limit 10

# Role-playing with continuation token targeting
python -m wisent_guard.cli tasks truthfulqa \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer -1 --limit 10 \
  --prompt-construction-strategy role_playing \
  --token-targeting-strategy continuation_token

# Direct completion with last token targeting
python -m wisent_guard.cli tasks mmlu \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 --limit 10 \
  --prompt-construction-strategy direct_completion \
  --token-targeting-strategy last_token

Implementation Details

For a complete understanding of how activation collection methods work in Wisent-Guard, including the full implementation of collection strategies, statistical methods, and optimization techniques, explore the source code:

View activation_collection_method.py on GitHub

Continue to Classifier