An MLP navigates by following gradient signals generated from the performance of a pre-trained classifier; this leads to effective delineation and resulting in practical linear steering vectors for use operationally.
Training Multilayer Perceptrons (MLPs) involves using neural nets to perform classification tasks that differentiate between high versus low levels of activity. In contrast, hyperplanes rely on linear classifiers; therefore, MLPs learn decision boundaries that are not linear and consequently excel at discerning fine distinctions within the range of activations. To refine further: Trained with neural networks, MLPs distinguish clearly between high and low activity levels through classification tasks. By comparison,
The essential takeaway is that although distinguishing features may not be linear, significant guidance vectors exist. Using the difference of gradients comparing classifier scores to activation levels leads to directions that maximize classification changes; averaging those gradients using weightings for negatives then gives robust directions pointing toward high likelihoods of correct positive assignments. The main idea is clear: despite no strict discrimination thresholds being linear, useful directional components also exist. Calculation of gradients relative to classifier score
During training regularization includes use of both layer normalization together with activation functions such as GELU; optimization utilizes AdamW optimizer along with policies regarding cosine learning rates and applying stopping criteria prevents over fitting via early stopping techniques.

Works: MLP learns nonlinear decision boundaries, handling cases where a⁻ is surrounded by a⁺.

Fails: When activations have no separable structure, no classifier can learn a meaningful boundary.
python -m wisent.cli tasks honesty_pairs.json --from-json --steering-mode --steering-method MLP --layer 15 --save-steering-vector honesty_mlp.pt
python -m wisent.cli tasks safety_pairs.json --from-json --steering-mode --steering-method MLP --layer 12 --mlp-hidden-dim 512 --mlp-num-layers 3 --mlp-dropout 0.2 --save-steering-vector safety_mlp.pt
python -m wisent.cli tasks refusal_pairs.json --from-json --steering-mode --steering-method MLP --layer 18 --mlp-epochs 200 --mlp-learning-rate 0.0005 --mlp-weight-decay 0.02 --save-steering-vector refusal_mlp.pt
python -m wisent.cli tasks test_questions.json --from-json --steering-mode --steering-method MLP --layer 15 --load-steering-vector honesty_mlp.pt --steering-strength 1.5
For the complete implementation of the MLP steering method in Wisent, see:
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.