Nonsensical Response Blocking

Nonsensical Response Blocking detects and prevents the generation of incoherent, illogical, or meaningless responses.

As you apply steering and identify representations, the risk of lobotomising the model increases. While being state-of-the-art, our methods do not guarantee perfect steering all of the time. As the intervention size increases, there is a risk of the model losing its coherence and generating really nonsensical responses. For example, if you steer towards the British vector too much, the model will output repetitive tokens connected to Britishness like "mate mate mate". If you mess up the activation steering, the bias for tokens the model results in the generation of tokens that normally have a small probability of occuring and do not represent the actual human-readable words like "ashsajdjah".

Even if the setup configurations you choose are resulting in this, we want to protect your model from outputting this. Therefore, the function of the nonsensical response component is to block such responses before they are sent to the final user.

Implementation Details

For the complete implementation of nonsensical response detection and blocking, explore the source code:

View stop_nonsense.py on GitHub