Benchmark Your Prompts Against Known Hallucinations
Run any LLM prompt against our curated dataset of hallucination triggers. Get a reliability score, catch failures before production, and ship AI features with confidence.
500+
Hallucination Test Cases
5
LLM Providers Supported
Real-time
Reliability Scoring
Simple Pricing
Pro
$25
per month
- ✓Unlimited prompt benchmarks
- ✓500+ curated hallucination test cases
- ✓GPT-4, Claude, Gemini, Mistral & more
- ✓Reliability score & detailed analytics
- ✓Export results as CSV or JSON
- ✓Priority email support
Cancel anytime. No contracts.
Frequently Asked Questions
What is a hallucination test case?
A hallucination test case is a prompt known to cause LLMs to generate false, fabricated, or confidently wrong information. Our dataset covers factual errors, fake citations, date confusion, and more.
Which LLM providers are supported?
We support OpenAI (GPT-4, GPT-3.5), Anthropic (Claude 3), Google (Gemini Pro), Mistral, and Cohere. You bring your own API keys — we never store them.
How is the reliability score calculated?
Each response is evaluated against expected outputs using semantic similarity and factual checks. Scores range from 0–100, with breakdowns by category so you know exactly where your prompt fails.