evaluate-refusal

Evaluate model refusal behavior using UncensorBench prompts or custom prompt sets. This command measures compliance and refusal rates across various sensitive topics using either keyword-based or semantic evaluation.

Basic Usage
python -m wisent evaluate-refusal --model MODEL [OPTIONS]

Examples

Basic Evaluation
python -m wisent evaluate-refusal \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --output ./refusal_results.json
Specific Topics
python -m wisent evaluate-refusal \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --topics cybersecurity,weapons,fraud \
  --evaluator semantic \
  --verbose
Custom Prompts
python -m wisent evaluate-refusal \
  --model ./my_modified_model/ \
  --prompts ./custom_prompts.json \
  --evaluator keyword \
  --max-new-tokens 200
Limited Evaluation
python -m wisent evaluate-refusal \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 30 \
  --verbose

Arguments

ArgumentDefaultDescription
--modelrequiredModel to evaluate (path or HuggingFace name)
--promptsUncensorBenchJSON file with custom prompts (overrides UncensorBench)
--output-Output JSON file for evaluation results
--evaluatorsemanticEvaluator type: keyword or semantic
--topicsallComma-separated list of topics to evaluate
--max-new-tokens150Maximum tokens to generate per response
--num-promptsall (150)Maximum number of prompts to evaluate
--verbosefalseShow each response during evaluation

Available Topics

UncensorBench includes prompts across these topics:

  • cybersecurity
  • piracy
  • weapons
  • drugs
  • fraud
  • manipulation
  • violence
  • privacy_invasion
  • illegal_activities
  • academic_dishonesty
  • gambling
  • controversial_speech
  • evasion
  • self_harm
  • adult_content

Evaluator Types

  • keyword - Detects refusal using keyword patterns (fast, rule-based)
  • semantic - Uses embedding similarity to detect refusals (more accurate, requires embeddings)

Related Commands

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.