verify-steering

Verify that a steered model's activations are correctly aligned with the intended steering direction at inference time. This command compares activations between the base model and the steered model to ensure steering is working correctly.

Basic Usage

python -m wisent verify-steering MODEL_PATH [OPTIONS]

Examples

Basic Verification

python -m wisent verify-steering ./steered_model/

With Custom Prompts

python -m wisent verify-steering ./steered_model/ \
  --prompts "Is the Earth flat?" "What is 2+2?" \
  --verbose

From Prompts File

python -m wisent verify-steering ./steered_model/ \
  --prompts-file ./test_prompts.json \
  --output ./verification_results.json

Check Specific Layers

python -m wisent verify-steering ./steered_model/ \
  --layers "10,15,20" \
  --alignment-threshold 0.5 \
  --verbose

TITAN Model with Gate Check

python -m wisent verify-steering ./titan_steered_model/ \
  --check-gate \
  --check-intensity \
  --verbose

Arguments

Required Arguments

Argument	Description
model_path	Path to the steered model (TITAN, PULSE, or CAA)

Model Options

Argument	Default	Description
--base-model	auto	Path or name of base model for comparison (auto-detected from config)
--device	auto	Device to use: auto, cuda, mps, cpu

Prompt Options

Argument	Description
--prompts	Test prompts to verify steering on (space-separated)
--prompts-file	JSON file containing test prompts

Verification Options

Argument	Default	Description
--layers	all	Comma-separated layer indices to check
--alignment-threshold	0.3	Minimum alignment score to consider steering successful
--check-gate	True	Check gate network discrimination (TITAN/PULSE)
--check-intensity	True	Check intensity network predictions (TITAN)

Output Options

Argument	Description
--output	Output file for detailed results (JSON format)
--verbose	Print detailed per-layer diagnostics

Understanding Results

Alignment Scores

> 0.5 - Steering working correctly
0 to 0.5 - Steering is weak, may need adjustment
< 0 - Steering going WRONG direction (critical issue)

Gate Network (TITAN/PULSE)

Gate ~1.0 for harmful prompts - Gate correctly activating
Gate ~0.0 for safe prompts - Gate correctly deactivating
Gate ~0.5 for all - Gate not discriminating (needs more training data)

Exit Codes

0 - Verification passed (alignment >= threshold)
1 - Verification failed (alignment < threshold or error)

Supported Steering Types

TITAN - Full adaptive steering with gate and intensity networks
PULSE - Conditional gating with sensor layer detection
CAA - Simple contrastive activation addition

The steering type is automatically detected from config files in the model directory (titan_config.json, pulse_config.json, or caa_config.json).

Related Commands

create-steering-vector - Create steering objects
optimize-steering - Optimize steering parameters
Weight Modification - Bake steering into model weights

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service