PRISM

Using an optimization approach which employs gradients and simultaneously considers different directions at each level, projection of features ensures coherent steering along distinct manifolds. For separate steering, using optimal gradient method investigation of diverse levels is employed for feature alignment.

How PRISM Works

Using gradient optimization alongside imposing constraints on independent representation, PRISM discovers multiple distinct routes at different levels; in sharp contrast to CAA that derives one path by computing differences. While CAA calculates just one direction toward target performance independently of others, PRISM separately identifies k such directions and integrates them into a unified manifold without overlapping or excessive loss when supervised.

This method draws upon recent research such as "Geometry of Refusal in Large Language Models" by Wollschläger.

PRISM optimizes using a combined target function that consists of four key elements: separation loss (ensuring clear differentiation among positives and negatives), independent loss (preventing blending across different experimental factors), similarity bounds (maintaining distinctness among axes rather than orthogonality) and reduction of modification for negatives (minimizing alteration to negative instances). Consistency is maintained here. "PRISM utilizes optimization with respect to a composite

When to Use PRISM

  • Multi-directional Behaviors: When the behavior is mediated by multiple directions (like refusal)
  • Manifold Structure: When you expect a cone or manifold geometry rather than linear
  • Ablation Tasks: When you want to remove rather than add a behavior
  • Minimal Side Effects: When preserving model behavior on non-target inputs is important
  • CAA Insufficient: When a single direction doesn't capture the full behavior
PRISM works with multi-layer separation

Works: Multiple layers (L1, L2, L3) each show separation. PRISM combines vectors Σvᵢ across layers.

PRISM fails when no layer shows separation

Fails: When no layer shows separation, conflicting vᵢ directions cancel out.

CLI Examples

Basic PRISM training with 3 directions
python -m wisent.cli tasks refusal_pairs.json --from-json --steering-mode --steering-method PRISM --layer 15 --prism-num-directions 3 --save-steering-vector refusal_prism.pt
PRISM with auto direction count
python -m wisent.cli tasks honesty_pairs.json --from-json --steering-mode --steering-method PRISM --layer 15 --prism-num-directions auto --prism-variance-threshold 0.85 --save-steering-vector honesty_prism.pt
PRISM with custom loss weights
python -m wisent.cli tasks safety_pairs.json --from-json --steering-mode --steering-method PRISM --layer 18 --prism-num-directions 5 --prism-retain-weight 0.2 --prism-independence-weight 0.1 --prism-optimization-steps 150 --save-steering-vector safety_prism.pt
PRISM with CAA initialization
python -m wisent.cli tasks bias_pairs.json --from-json --steering-mode --steering-method PRISM --layer 12 --prism-num-directions 4 --prism-use-caa-init --prism-cone-constraint --save-steering-vector bias_prism.pt

Parameters

Manifold Configuration

--prism-num-directions
Number of directions per layer (default: 3, or 'auto')
--prism-variance-threshold
Target cumulative variance for auto num_directions (default: 0.80)
--prism-max-directions
Maximum directions when using auto (default: 10)

Optimization Parameters

--prism-optimization-steps
Number of gradient descent steps (default: 100)
--prism-learning-rate
Learning rate for direction optimization (default: 0.01)

Loss Weights

--prism-retain-weight
Weight for retain loss / side effect minimization (default: 0.1)
--prism-independence-weight
Weight for representational independence between directions (default: 0.05)
--prism-ablation-weight
Weight for ablation effectiveness loss (default: 1.0)
--prism-addition-weight
Weight for addition effectiveness loss (default: 1.0)

Constraints

--prism-use-caa-init
Initialize first direction using CAA (default: true)
--prism-cone-constraint
Constrain directions to form a polyhedral cone (default: true)
--prism-min-cosine-similarity
Minimum cosine similarity between directions (default: 0.3)
--prism-max-cosine-similarity
Maximum cosine similarity to avoid redundancy (default: 0.95)
--normalize
L2-normalize the resulting vectors (default: true)

For the complete implementation of the PRISM steering method in Wisent, see:

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.