Frequently Asked Questions

This page covers your most common questions about Wisent technology.

Traditional guardrails require specification of specific filters pertinent to your use case; basically hoping the AI does exactly as intended. You can't follow what happens inside an AI's brain. Saying something like "Reject this request" at present doesn't reveal whether it would have said something more along the lines of "Here is the content you asked for but which I deem harmful". Traditional safety protections also seem quite fragile: Regex detection tends to miss things outside expected distributions. How do we incorporate notions such as hallucinations or harmfulness across different languages into traditional safety mechanisms? Alternatively if you prefer not to engineer prompts directly and instead intervene at the level of AI brain itself, this may be worth considering.

This method beats all other options we tested, cutting hallucination rates by 43 percent on Llama 3.1 8B when measured against TruthfulQA. The results look even better for harmfulness evaluation. Feel free to verify this yourself with the code in the evaluation folder. Here is an example command:

Hallucination detection with classifier

python -m wisent.cli tasks truthfulqa --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --classifier-type logistic --save-classifier ./models/hallucination_classifier.pt --verbose

...and analyze the results in the guard_results.csv file using a human evaluator. Claude evaluator does not fare too well for the hallucination task so we recommend going through it by hand. You should get something like this: results spreadsheet. For out of distribution hallucinations that the classifier has not been trained on, the guard catches 12 out of 28 hallucinations, so 43%. This is at a cost of 5 false positives so without further detection the net impact is 7 instances of higher accuracy.

Wisent is an experimental technology. Representation engineering requires a careful choice of hyperparameters. For every model, you need to setup the right model tokens, activation reading patterns and layers to read the activations from. You can read more about it here: arxiv paper or talk to me directly and I can help you set up Wisent for your specific use case: book a call. If you are struggling with latency or compute we can help!

Our guardrails provide another level of safety assurance alongside other methods that you typically employ; we develop tools powered by representation engineering for tracking thoughts from models and blocking them. To make it simpler: Our guards give extra assurance along with usual safeguards; we create tools using representation engineering to follow thoughts from

It begins by activating readings that identify corresponding activation patterns for specific harmful behavior; think of these as monitoring thought processes in the LLM. Doing this is preferable to using a benchmark score or final output evaluation.

This lets you monitor unusual behaviors in your LLM that trigger related patterns in its internal processing.

The method chosen for activation fingerprinting follows that of CAA; we convert pairs showing both positive and negative behavior to pairs which extract activation information through use of one token alone. For instance, suppose submitting pairings such as...

Good

"I am sorry I have to decline this request."

Bad

"Sure, here is the recipe for the bomb."

It becomes difficult to decide which token corresponds to differing levels of harm; whereas converting into this: You must now rewrite this response considering a different scenario involving medical treatments and side effects. Considering medical treatment outcomes and adverse reactions, determining which

Good

"[instructor tag] Which one of those is better: A. good answer B. bad answer [user] A"

Bad

"[instructor tag] Which one of those is better: A. good answer B. bad answer [user] B"

Quickly it is evident what data should be read from. Difference in activation levels among A and B would ideally highlight distinctions in activation patterns; this allows identification of specific layers that show such differences. We could rely on results from other research to select which layer provides relevant information or conduct an exhaustive search over different layers to choose the highest performer.

Once extracted these activations, the first important decision is regarding how to proceed using this information; one option involves computing a vector based on average activations, normalizing that vector and comparing cosine similarity scores against token activations during inference time. Another suggestion is training classifiers directly on activation data and monitoring activations. Choice between these alternatives depends on prioritizing either speed or specific cases where performance of one method surpasses another. Optimization algorithms may be used to ensure all features are aligned correctly.

We can also track successive development of our model against harmful examples that we receive from you so as to better match your requirements. To Track subsequent iterations of our model based on harmful samples provided by you allows us to more closely align to your specifications. We

Still have questions? We're here to help!

Schedule a call with our team or reach out through our community channels.

Schedule a Call

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service