Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Connect your client
Research

HalBench ranks frontier models on sycophancy and hallucination resistance

A new open benchmark tests how readily Claude, Grok, GPT, and Gemini comply with false premises, revealing significant gaps in resistance to social pressure and fabrication.

1 min read

A researcher has released HalBench, an open benchmark designed to measure how readily large language models agree with false premises and hallucinate supporting content under social pressure. The benchmark tested 3,200 false-premise prompts across four frontier models—Claude Sonnet 4.6, Grok 4.3, GP...

Sign in to read the full analysis

Free — just an email. Get full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.

Method & sources
Source type
Primary publication (lab/vendor blog) — our analysis + implication
Source link
r/localllama
Published
UTC
Byline
By the gotcontext.ai team (editorial standards)
Correction?
corrections@gotcontext.ai
HalBench ranks frontier models on sycophancy and hallucination resistance — gotcontext.ai