Compression Benchmark Methodology

1. Why we built this¶

Every major LLM benchmark (MMLU, HumanEval, GPQA) tests models on raw text. None of them test models on compressed text. That gap matters to us specifically: our product compresses context before it reaches the model, so “does quality survive 5× compression?” is our most important product question.

The compression-friendliness wedge is real. Different models tolerate compressed input differently. A model that scores 92% on HumanEval from raw text might score 85% from 5×-compressed text, while a cheaper model might score 88% on raw text but only 72% from compressed text. The crossover point (where compression breaks the cheaper model’s cost advantage) is what this benchmark is designed to find.

No third-party runs this benchmark. We do. That means we are both the benchmark operator and a party with commercial interest in the results. We are disclosing our methodology in full so you can reproduce any row, verify our judge prompts, and flag any bias we missed.

2. The corpora and their licenses¶

We use three evaluation corpora. Full attribution and reproducibility commands are in corpora/CITATIONS.md.

Corpus	Source	License	Slice
humaneval_v0.jsonl	openai/human-eval	MIT	First 30 of 164 (HumanEval/0 to 29)
ruler_qa_v0.jsonl	NVIDIA/RULER	Apache-2.0	QA subset @ 16k seq, 20 samples (mock until Wave 6)
code_review_v0.jsonl	Hand-curated public-repo snippets	CC-BY-4.0	10 samples (hand-authored)

Note on the RULER corpus. RULER generates QA synthetically from its own scripts; no pre-built JSONL is available in the upstream repo. The v0 corpus uses 20 deterministic long-context QA pairs matching the RULER schema. Wave 6 (the live harness run) will replace these with pairs generated by running RULER’s generation script directly.

3. The judge prompt¶

We use a fixed prompt committed to the repository at benchmarks/compression-leaderboard/src/judge_prompts.py. Any change to this prompt constitutes a methodology change and is recorded in the revisions log below.

The system prompt tells the judge to score 0 to 100 on four dimensions and return JSON only. The user template injects the task instruction and model response.

JUDGE_SYSTEM (verbatim)textView source

You are an expert code reviewer. Score a model's response 0-100 on:
- Correctness (40 pts): does the response solve the task?
- Completeness (30 pts): does it address all parts?
- Clarity (15 pts): is it readable and well-structured?
- Safety (15 pts): does it avoid security/correctness pitfalls?

Output JSON: {"score": <0-100>, "reasoning": "<one sentence>"}.
Do not include any other text.

JUDGE_USER_TEMPLATE (verbatim)textView source

Task: {task_instruction}

Model response:
{model_output}

Score the response.

When judging at 5× compression, {model_output} is replaced with the model’s response to the compressed version of the task context. The task instruction itself is not compressed.

4. Cross-validation rule¶

Each (compressed response, task) pair is judged by two independent LLMs from different provider families: Claude Opus 4.7 (Anthropic) and GPT-5.5 (OpenAI). Using two judges from different training pipelines and RLHF objectives reduces the risk that a shared training bias inflates or deflates scores systematically.

The per-model quality score, llm_judge_avg_at_5x, is the mean of both judges’ scores across all pairs for that model.

Threshold	Effect on leaderboard
Aggregate disagreement ≤ 15%	Ranked scores shown; cost-per-quality-point column live.
Aggregate disagreement > 15%	Methodology-only mode. No ranked scores; models listed alphabetically with raw compression ratio only.
Per-pair disagreement > 25%	Pair excluded from aggregate; preserved in raw `judge_pairs` field for inspection.

5. The pre-ship stability gate¶

Before any quality score appears in the leaderboard, we run the stability gate:

Stability gatebash

python -m src.stability --pilot output/benchmark_seed_runs.json --max-disagreement 0.15

Exit code	Effect
0	Aggregate disagreement < 15%. Ranked scores published.
1	Gate failed. Leaderboard switches to methodology-only mode; no `cost_per_quality_point` column shown.

The v0 leaderboard shows “—” (em-dash placeholder) for all quality columns. Pricing data is real (hand-verified from official provider pages as of 2026-05-09). The quality scores will populate once the live harness run passes the stability gate.

Swap-consistency check. Each judge pair is scored in both orders: compressed-first and uncompressed-first. If the judge flips its verdict between the two orderings, the pair is marked swap_consistent=False and excluded from the aggregate score. Pairs that are not swap-consistent are counted toward the stability gate’s disagreement rate.

6. Reproducibility¶

Every row in the leaderboard can be reproduced in under 5 minutes for under $1 in API costs. Each block below has a copy button for clean paste.

Prerequisites: Python 3.11+, uv installed, repo cloned at benchmarks/compression-leaderboard, and API keys exported (Anthropic, OpenAI, Google).

Run a single modelbash

cd benchmarks/compression-leaderboard
uv pip install -e .
GOTCONTEXT_KEY=gc_... ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GOOGLE_API_KEY=... \
  python -m src.aggregate --model claude-opus-4-7 --output output/single.json

Run all 13 models (full seed run)bash

python -m src.aggregate --all --output output/benchmark_seed_runs.json

Re-run the stability gatebash

python -m src.stability --pilot output/benchmark_seed_runs.json --max-disagreement 0.15

Reproduce the corpus slicesbash

# HumanEval first 30
curl -fsSL https://raw.githubusercontent.com/openai/human-eval/master/data/HumanEval.jsonl.gz \
  | gunzip | head -n 30

# RULER QA (requires RULER repo + dependencies)
python scripts/data/synthetic/qa.py --seq_length 16000 --num_samples 20

7. What we don’t measure yet¶

The v0 leaderboard measures quality at exactly 5× compression. Several important dimensions are deferred to v1:

Higher compression ratios (10×, 20×). Quality degradation at higher compression ratios is non-linear and model-specific. A model that holds up well at 5× may collapse at 10×.
Hardware × model interaction. Inference latency at 5× compression varies by hardware (H100 vs A100 vs consumer GPUs) and serving framework (vLLM, TensorRT-LLM, TokenSpeed). We measure quality only; latency is out of scope for v0.
Domain-specific corpora. The v0 corpora are code-heavy. Legal, medical, and scientific document compression may show different model rankings.
Multilingual evaluation. All v0 corpora are English-only. Compression quality varies significantly across languages, especially for non-Latin scripts.
Multi-turn agentic tasks. Quality at single-turn completion is not the same as quality at multi-turn agent sessions where compressed context accumulates across rounds.

8. Anti-compression bias: known limitation and mitigations¶

Disclosure. LLM judges have a documented perplexity-driven self-preference bias. Compressed text has higher perplexity than natural text, which may cause judges to systematically score compressed-input responses lower than equivalent-quality natural-input responses. This is a known limitation of LLM-as-judge methodology that our design partially mitigates but does not fully eliminate.

The underlying mechanism was characterized by Wataoka et al. at ICLR 2025 in their study of LLM self-preference: Wataoka et al. (2025): LLM self-preference (ICLR). Their finding is that GPT-4’s self-preference is driven by perplexity: the model assigns higher quality scores to text that resembles its own training distribution (lower perplexity) and lower scores to text that departs from it (higher perplexity). Compressed text by definition has higher perplexity: it removes redundant tokens, reducing the local predictability that LLM judges reward.

The practical implication: our judges may systematically underrate compressed-input responses relative to the ground truth. This bias is consistent across all models being judged, which means it does not change the relative ranking between models, but it may cause the absolute llm_judge_avg_at_5x scores to run below what human evaluators would assign.

How we partially mitigate this:

Two-judge cross-family ensemble. Using Claude Opus 4.7 (Anthropic) and GPT-5.5 (OpenAI) in parallel means neither judge can dominate the score on its own. Disagreement above 15% triggers the stability gate and withholds ranked scores. This does not eliminate perplexity bias (both judges are LLMs and both share the underlying mechanism), but it catches cases where one judge’s bias is unusually severe.
Swap-consistency protocol. Every pair is judged in both orders (compressed-first and uncompressed-first). A verdict that flips between orderings indicates position bias rather than quality assessment. Swap-inconsistent pairs are excluded from the aggregate score.
Style normalization (see §9). Stripping markdown formatting before judging removes one surface where compressed text looks structurally different from natural text, reducing but not eliminating the perplexity gap.

We do not claim these mitigations fully eliminate anti-compression bias. If you have a stronger mitigation to propose (for example, using a Gemini judge as a third cross-family vote when the two-judge ensemble disagrees by more than 10%), open an issue on the GitHub (methodology template).

8.5. v0 measurement constraint: why this page ships methodology-only¶

Disclosure (2026-05-10). The leaderboard at /benchmarks/compression ships with placeholder rows pending a v1 harness redesign. v0 piloted the synthetic harness at 5× compression against claude-haiku-4-5 via a chat-completion API and discovered an architectural mismatch between the gotcontext compression mode and one-shot chat-completion benchmarking.

The compression engine produces a semantic skeleton output: === SEMANTIC SKELETON === [HIDDEN nodes]. High-importance summaries are visible; lower-importance content is replaced with [HIDDEN] markers that an MCP-tool-calling agent can drill into via the modulate_region tool. A one-shot chat-completion call cannot drill in: it sees the skeleton and honestly reports it cannot answer the question. That is the correct model behavior, not a quality regression. Measuring “compression-friendliness” against this mode requires either an MCP harness OR a different compression mode that emits self-contained summaries.

v1 of this leaderboard will address the constraint along one or more axes:

MCP-tool-calling harness. Agents that can call modulate_region on demand. Closer to the production usage pattern of MCP-aware models like Claude Code.
Lower compression target for short / single-pass tasks. “Compress or Route?” (OpenReview 2026) identifies a compression-rate threshold of r ≥ 0.6 (≤ 1.67×) for code-task quality preservation. v1 surfaces a multi-ratio curve (1.5× / 2× / 3× / 5×) rather than a single 5× point.
Long-context corpora better suited to 5× compression. LongBench (GovReport, NarrativeQA) inputs of 8K+ tokens compress to 1.6K with answer-bearing content directly readable. RULER’s needle-in-haystack design at 16K assumes the haystack stays accessible, incompatible with our skeleton mode.
Judge calibration redesign. v0 pilot data showed both judges floor-clamped at near-zero on garbage inputs, with our denominator-floor in the stability gate masking what was actually a 100% inter-judge calibration gap. v1 prompts will produce meaningful spread on bad outputs as well as good ones.

We are publishing this constraint up front rather than backfilling fabricated numbers. The methodology, harness code, and reproducibility scripts are real and on GitHub. The numerical leaderboard is staged for v1.

9. Style normalization before judging¶

A separate bias source is formatting style. An OpenReview 2025 debiasing study found that style variation (markdown fences, whitespace runs, header levels) drives scoring bias with severity 0.76 to 0.92 across all judge models tested. A response with tidy markdown and bullet points scores higher than an equivalent response in plain prose, regardless of correctness.

Before every judge call, we run the response through benchmarks/compression-leaderboard/src/judge_normalize.py, which:

Strips markdown fences (triple-backtick code blocks are replaced with their content, language tag removed).
Collapses whitespace runs (multiple spaces → single space; multiple blank lines → one blank line).
Strips heading markers (#, ##, ###).
Preserves code content intact (only the fence markers are removed).

This normalization is applied to both the compressed and uncompressed response before judging, so neither gets a style advantage. It does not change correctness: it only reduces the signal available to the judge for style-based discrimination.

The normalization is applied identically in both swap orderings, so it does not interact with the swap-consistency check.

10. How to submit a benchmark result¶

Community submissions are planned for a future release (Task 16 in the harness roadmap). When the submission form launches, it will be linked here.

In the meantime, use the prefilled issue templates below (they route to the right label and reviewer):

Submit a result

Model name, provider, pricing source, any quality scores you ran.

Propose a mitigation

Anti-bias improvements, judge ensemble changes, calibration ideas.

Report a harness bug

Reproducer command, expected output, actual output.

Revisions¶

Section §3 commits us to dated entries on any methodology change. The full log:

Version	Date	Summary
v0.3	2026-05-14	Container rewrite: TOC, anchored sections, copy buttons, source-pinned file links, real lists (no em-dash pseudo-bullets), exit-code + threshold tables, status callout promoted to top of post, submit-result CTA cards. Content unchanged.
v0.2	2026-05-10	Added §8.5 (v0 measurement constraint). Discovered the gotcontext skeleton mode vs one-shot chat-completion mismatch. Leaderboard switched to methodology-only mode pending v1 harness redesign.
v0.1	2026-05-09	Initial publish.

Cite this¶

Researchers, analysts, or journalists referencing this methodology in writeups can use either:

BibTeXbibtex

@misc{gotcontext-compression-benchmark-methodology-2026,
  title  = {How we measure quality at 5× compression},
  author = {Hollingsworth, James},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/compression-benchmark-methodology},
  note   = {gotcontext.ai engineering blog. Methodology v0.3 (updated 2026-05-14).},
}

APAtext

Hollingsworth, J. (2026, May 9). How we measure quality at 5× compression. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/compression-benchmark-methodology (Methodology v0.3, updated 2026-05-14).

← Back to the compression benchmark leaderboard