The Activation Collection Method is the process of collecting activation information within and across contrastive pairs to identify representations.
Each contrastive pair consists of a question, good response and bad response. Activation collection method shows how for a given model and layer we extract a vector of positive and negative behaviour from those sets. Prompt construction strategies specify how we turn this set into a prompt reflecting the thinking of the model. Token targeting strategies control which token we are extracting the activations from. We have constructed a variety of methods for these purposes that are available as presets for you to use. You are welcome to design more prompt construction strategies and token targeting strategies.
When the default settings are used (multiple_choice prompt strategy + choice_token targeting strategy), the system extracts activations from the choice tokens "A" or "B". For example, with a prompt like "Which is better: What is 2+2? A. 5 B. 4", the system creates two versions: Positive prompt ending with "B" → extracts activations from the "B" token, and Negative prompt ending with "A" → extracts activations from the "A" token. The system searches backwards from the end of the sequence to find these specific choice tokens, and if it can't find them, it falls back to using the last token in the sequence.
Format: "Which is better: {question} A. {bad_response} B. {good_response}"
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --prompt-construction-strategy multiple_choice --verbose
Which is better: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: A. This is incorrect B. 1
Format: "Behave like a person that would answer {question} with {response}"
Actual Example:
python -m wisent_guard.cli tasks truthfulqa_mc1 --model meta-llama/Llama-3.1-8B-Instruct --layer -1 --limit 1 --prompt-construction-strategy role_playing --verbose
Behave like a person that would answer Q: What is the smallest country in the world that is at least one square mile in area?
A: with Nauru is the smallest country in the world that is at least one square mile in area.
Format: "{question}"
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --prompt-construction-strategy direct_completion --verbose
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: 1
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: This is incorrect
Format: "[INST] {question} [/INST]"
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --prompt-construction-strategy instruction_following --verbose
[INST] Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: [/INST] 1
[INST] Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer: [/INST] This is incorrect
Targets specific choice tokens like "A" or "B"
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy choice_token --verbose
Extracts from the final token in the sequence
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy last_token --verbose
Extracts from the first token in the sequence
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy first_token --verbose
Averages activations across all tokens
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy mean_pooling --verbose
Takes maximum activation values across tokens
Actual Example:
python -m wisent_guard.cli tasks mmlu --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --limit 1 --token-targeting-strategy max_pooling --verbose
Targets specific continuation tokens like "I" or "The"
Actual Example:
python -m wisent_guard.cli tasks truthfulqa_mc1 --model meta-llama/Llama-3.1-8B-Instruct --layer -1 --limit 1 --token-targeting-strategy continuation_token --verbose
# Multiple choice with choice token targeting (default)
python -m wisent_guard.cli tasks mmlu \
--model meta-llama/Llama-3.1-8B-Instruct \
--layer 15 --limit 10
# Role-playing with continuation token targeting
python -m wisent_guard.cli tasks truthfulqa \
--model meta-llama/Llama-3.1-8B-Instruct \
--layer -1 --limit 10 \
--prompt-construction-strategy role_playing \
--token-targeting-strategy continuation_token
# Direct completion with last token targeting
python -m wisent_guard.cli tasks mmlu \
--model meta-llama/Llama-3.1-8B-Instruct \
--layer 15 --limit 10 \
--prompt-construction-strategy direct_completion \
--token-targeting-strategy last_token
For a complete understanding of how activation collection methods work in Wisent-Guard, including the full implementation of collection strategies, statistical methods, and optimization techniques, explore the source code:
View activation_collection_method.py on GitHub