verify-steering

Verify that a steered model's activations are correctly aligned with the intended steering direction at inference time. This command compares activations between the base model and the steered model to ensure steering is working correctly.

Basic Usage
python -m wisent verify-steering MODEL_PATH [OPTIONS]

Examples

Basic Verification
python -m wisent verify-steering ./steered_model/
With Custom Prompts
python -m wisent verify-steering ./steered_model/ \
  --prompts "Is the Earth flat?" "What is 2+2?" \
  --verbose
From Prompts File
python -m wisent verify-steering ./steered_model/ \
  --prompts-file ./test_prompts.json \
  --output ./verification_results.json
Check Specific Layers
python -m wisent verify-steering ./steered_model/ \
  --layers "10,15,20" \
  --alignment-threshold 0.5 \
  --verbose
TITAN Model with Gate Check
python -m wisent verify-steering ./titan_steered_model/ \
  --check-gate \
  --check-intensity \
  --verbose

Arguments

Required Arguments

ArgumentDescription
model_pathPath to the steered model (TITAN, PULSE, or CAA)

Model Options

ArgumentDefaultDescription
--base-modelautoPath or name of base model for comparison (auto-detected from config)
--deviceautoDevice to use: auto, cuda, mps, cpu

Prompt Options

ArgumentDescription
--promptsTest prompts to verify steering on (space-separated)
--prompts-fileJSON file containing test prompts

Verification Options

ArgumentDefaultDescription
--layersallComma-separated layer indices to check
--alignment-threshold0.3Minimum alignment score to consider steering successful
--check-gateTrueCheck gate network discrimination (TITAN/PULSE)
--check-intensityTrueCheck intensity network predictions (TITAN)

Output Options

ArgumentDescription
--outputOutput file for detailed results (JSON format)
--verbosePrint detailed per-layer diagnostics

Understanding Results

Alignment Scores

  • > 0.5 - Steering working correctly
  • 0 to 0.5 - Steering is weak, may need adjustment
  • < 0 - Steering going WRONG direction (critical issue)

Gate Network (TITAN/PULSE)

  • Gate ~1.0 for harmful prompts - Gate correctly activating
  • Gate ~0.0 for safe prompts - Gate correctly deactivating
  • Gate ~0.5 for all - Gate not discriminating (needs more training data)

Exit Codes

  • 0 - Verification passed (alignment >= threshold)
  • 1 - Verification failed (alignment < threshold or error)

Supported Steering Types

  • TITAN - Full adaptive steering with gate and intensity networks
  • PULSE - Conditional gating with sensor layer detection
  • CAA - Simple contrastive activation addition

The steering type is automatically detected from config files in the model directory (titan_config.json, pulse_config.json, or caa_config.json).

Related Commands

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.