Contrastive Pair

A Contrastive Pair is a set of two strings where one corresponds to a positive instance of a trait and the other to a negative one.

A contrastive pair is a set of a question and two strings that should have the opposite meaning. We use them to extract a particular trait from the internal thinking of the modelModelA model is a set of weights used to generate responses. At the moment, Wisent only works with open source models. Each model has a distinct parameter size and special tokens to mark the beginning of the model response and user query.. Ideally the contrastive pairs should be the same sentence, minimally changed to reflect the presence of a particular trait and lack therof. So for example, if you want to identify hallucinations, the contrastive pair should include an example of a hallucination and of a truthful response to a similar question. A good question would be for example: What is the capital of Japan? With the answers being "The capital of Japan is Paris" for the good response and "The capital of Japan is Tokio" for the bad response.

However, contrastive pairs can be used to illustrate variety of traits. Instead of focusing on hallucination detection, we can use those examples to create representationsRepresentationA high level concept embedded within the weights of the neural network. To be honest, the exact definition of what a representation is can be a bit difficult. It can be really wide, like a representation of hallucination or good coding ability. It can be pretty narrow like knowledge about a particular historical fact or being able to perform a particular task. Representations get acquired in training through process known as representation learning. Representation engineering however, focuses on observing and changing representations at runtime. of various traits like being good at coding, being British or French. Through this, you can create and audit models for particular capabilities or personal traits. Just use the contrastive pairs corresponding to these traits and see for yourself!

To save you time defining those contrastive pairs yourself, Wisent-Guard supports over 6000 benchmarks that you can use to automatically create contrastive pairs. Every benchmark that is available in lm-harness can be used using Wisent-Guard.

For example, if you want to focus on hallucinations, you can use truthfulqa for this instead of manually defining pairs for yourself using the "tasks" argument in our CLI command. Since benchmarks are created to capture a particular trait (like how good is the model at coding) and contain information representative to this trait, they are a perfect source of contrastive pairs. You can use them automatically within the Wisent-Guard to create robust contrastive pair sets for speaking languages, mathematical ability, hallucinations or coding.

Hallucination

Question:

"What is the capital of Japan?"

Positive Instance

"The capital of Japan is Paris"

Negative Instance

"The capital of Japan is Tokyo"

Poor Coding

Question:

"Write a Python function to reverse a string"

Positive Instance

"def reverse(s): return s.backwards()"

Negative Instance

"def reverse(s): return s[::-1]"

Rudeness

Question:

"How would you ask someone to close the door?"

Positive Instance

"Close the door!"

Negative Instance

"Could you please close the door?"

Uncertainty

Question:

"What is the weather like tomorrow?"

Positive Instance

"I'm not sure about tomorrow's weather"

Negative Instance

"Tomorrow will definitely be sunny"

You don't have to define contrastive pairs yourself. Wisent-Guard supports creating contrastive pairs from a simple description. So instead of having to identify the right benchmark and manually create the contrastive pairs, you can use the built-in tools to create varied synthetic pairs for further reuse or dynamically generate the contrastive pair set for a description for your steering and classifier training.

When generating synthetically, it is important to specify the number of contrastive pairs to generate and the similarity threshold they should maintain. We want the data to be diverse so Wisent-Guard is using a logic for eliminating generated pairs that are too similar or repetitive. You don't have to use the synthetically generated pairs immediately. You can either save them or use synthetically generated pairs you generated earlier.

Examples

Basic Generation

python -m wisent_guard generate-pairs \ --trait "refuse harmful requests politely" \ --output pairs.json

Advanced Generation

python -m wisent_guard generate-pairs \ --trait "The model should be helpful while maintaining appropriate boundaries" \ --num-pairs 50 \ --output boundary_pairs.json \ --model meta-llama/Llama-3.1-8B-Instruct \ --device cuda \ --similarity-threshold 0.9 \ --verbose

Full Pipeline (Generate + Train + Test)

python -m wisent_guard synthetic \ --trait "be truthful and avoid misinformation" \ --num-pairs 40 \ --save-pairs truthful_pairs.json \ --layer 15 \ --steering-method CAA \ --steering-strength 1.5 \ --test-questions 10 \ --output ./results

Loading Generated Pairs

import json
from wisent_guard.core.contrastive_pairs import load_synthetic_pairs_cli

# Load pairs for further processing
pair_set = load_synthetic_pairs_cli("pairs.json")
print(f"Loaded {len(pair_set.pairs)} contrastive pairs")

# Access individual pairs
for pair in pair_set.pairs:
    print(f"Scenario: {pair.scenario}")
    print(f"Positive: {pair.positive_response.text}")
    print(f"Negative: {pair.negative_response.text}")

Implementation Details

For a complete understanding of how contrastive pairs work in Wisent-Guard, including the full implementation of pair creation, validation, and processing logic, explore the source code: