Frequently Asked Questions

Why would I use this instead of traditional guardrails?

With traditional guardrails, you need to specify filters for your particular use case. You essentially hope that the model is doing what you want. You cannot track what is happening in the brain of your AI. Just because it says "I will reject the request." right now doesn't give you the information how far it was from actually saying "Sure, here is the harmful stuff you requested." Also, the safety traditional safeguards provide is really brittle. If you are using regexes, it is hard to detect out of distribution stuff. How do you encode the concept of a hallucination in a traditional safeguard? Or harmfulness across languages? If you'd rather not beg your LLM to do what you want using prompt engineering, consider interventions directly on the AI brain.

Show me the results!

This approach outperforms every possible alternative, resulting in a 43 percent hallucination rate reduction on Llama 3.1 8B on TruthfulQA. But it works even better for harmfulness evaluation. You can test it yourself using the code in the evaluation folder. Just run something like this:

python evaluation/evaluate_llama_truthfulqa_classifier.py --train-classifier --classifier-path ./models/hallucination_classifier.joblib --classifier-model logistic --use-classifier

and analyze the results in the guard_results.csv file using a human evaluator. Claude evaluator does not fare too well for the hallucination task so we recommend going through it by hand. You should get something like this: results spreadsheet. For out of distribution hallucinations that the classifier has not been trained on, the guard catches 12 out of 28 hallucinations, so 43%. This is at a cost of 5 false positives so without further detection the net impact is 7 instances of higher accuracy.

How is it different from existing approaches?

Wisent-guard uses a unique approach for identifying unwelcome representations in the activation space. It is different from circuit breakers and SAE-based mechanistic interpretability to balance accuracy and speed. You can read about SAE-based approaches resulting in about 6 percent hallucination reduction here: LessWrong post.

It does not work for my use case, why?

Wisent-guard is an experimental technology. Representation engineering requires a careful choice of hyperparameters. For every model, you need to setup the right model tokens, activation reading patterns and layers to read the activations from. You can read more about it here: arxiv paper or talk to me directly and I can help you set up the guard for your specific use case: book a call. If you are struggling with latency or compute we can help!

Tell me in depth how it works!

Our guardrails add an additional layer to the existing safety-assurance methods you might use. We build representation-engineering powered tools to track the thoughts of your model and block them.

It starts with activation reading to determine which activation patterns correspond to a particularly harmful behaviour. Think of this as us tracking the thoughts of the LLM. This is much better than evaluating a benchmark or the final response.

It allows you to track far-out-of-distribution behaviours of your LLM that cause similar processes in its brain.

The activation fingerprinting we chose follows the CAA method. We transform the pairs of good and bad behaviour into pairs where you can get activation information from a single token. For example, if you submit pairs like:

Good:

"I am sorry I have to decline this request."

Bad:

"Sure, here is the recipe for the bomb."

Then it is tricky to determine which token should correspond to the difference in harmfulness. Whereas if you convert it to the following:

Good:

"[instructor tag] Which one of those is better: A. good answer B. bad answer [user] A"

Bad:

"[instructor tag] Which one of those is better: A. good answer B. bad answer [user] B"

It becomes immediately very clear where the activation should be read from. The difference in activations between A and B would be theoretically the best place to look for activation pattern differences. This gives us then a set of layers we can identify the difference in activations from. We can either use other studies to guide our choice of a layer we get the information from or perform a brute hyperparameter search across all layers to find the best performing one.

Once we extract those activations, we make the first big choice of what to do with this data. One of the options is to compute a vector from the activation means, normalize it and at inference time compare the cosine similarity to the activations from the tokens. The other we recommend is to train a classifier directly on the activation data and monitor activations. Depending on whether you want to prioritize speed or have a particular use case where one method performs better than the other. You can also run optimization algos to see whether everything works as you intend it to.

We can then monitor the generations of the model for their similarity with the harmful examples you provide us with to better fit your needs.

Still have questions?

We're here to help! Schedule a call with our team or reach out through our community channels.