When applying steering and identifying representations, there is an increased risk of disconnecting or 'lobotomizing' the model. Although very advanced, method performance does not always perfectly guide interventions; as intervention sizes increase there is also a chance for loss of coherence resulting in absurd outputs. For instance, when directing toward strong vectors for Britain, the model generates repeated tokens such as "mate" repeatedly. Steer activation poorly and resultant token generation often contains unlikely elements with no relation to typical real language like "ashsajdhja."

We aim to safeguard against any outputs that result from configuration selections and therefore the functionality of the illogical response element blocks these responses prior to delivery to users at large.

For full implementation of nonsense response detection and blockage study the source code.

View CLI examples on GitHub

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service