train-unified-goodness

Using pooled benchmark data train one "high quality" vector for steering. This results in a unified vector that encodes common good behavior across various different tasks and domains. To train a single high 'goodness' steering vector using pooled multi task data leads to creating

Basic Usage
python -m wisent train-unified-goodness --model MODEL [OPTIONS]

Examples

Basic Training
python -m wisent train-unified-goodness \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --output ./vectors/unified_goodness.pt
With Specific Benchmarks
python -m wisent train-unified-goodness \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --benchmarks truthfulqa_mc1 mmlu hellaswag \
  --samples-per-benchmark 200 \
  --output ./vectors/unified_goodness.pt
With Weighted Benchmarks
python -m wisent train-unified-goodness \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --benchmark-weights truthfulqa_mc1:2.0 mmlu:1.0 hellaswag:1.5 \
  --output ./vectors/unified_goodness.pt
Full Training with All Options
python -m wisent train-unified-goodness \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --samples-per-benchmark 500 \
  --steering-method CAA \
  --test-after-training \
  --test-prompts 10 \
  --output ./vectors/unified_goodness.pt \
  --verbose

Arguments

Model & Training

ArgumentDefaultDescription
--modelrequiredModel name or path
--layer15Layer for activation extraction
--steering-methodCAASteering method to use
--deviceautoDevice to run on

Benchmark Selection

ArgumentDefaultDescription
--benchmarksall availableSpecific benchmarks to use
--samples-per-benchmark100Number of samples per benchmark
--benchmark-weightsequalCustom weights for benchmarks (format: name:weight)

Output & Testing

ArgumentDescription
--outputOutput path for the unified vector
--test-after-trainingRun test prompts after training
--test-promptsNumber of test prompts to run
--verboseEnable verbose output

How It Works

  1. Collect Data - Gathers contrastive pairs from multiple benchmarks
  2. Pool Data - Combines all pairs into a unified training set
  3. Weight Samples - Optionally applies custom weights to different benchmarks
  4. Extract Activations - Gets model activations for all pairs
  5. Compute Direction - Calculates the unified steering direction
  6. Test Vector - Optionally evaluates on test prompts

Use Cases

  • General Improvement - Create a single vector that improves model behavior across tasks
  • Multi-Domain Steering - Steer behavior consistently across different domains
  • Transfer Learning - Use unified vector as starting point for specialized steering

Related Commands

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.