RTX 5080 16GB runs Qwen 35B at 56 tok/s on 128k context—but MTP slows it down

A LocalLLaMA benchmark shows Multi-Token Prediction hurts MoE model speed on consumer GPUs because the compute buffer forces expert layers to CPU. The 35B Q4_K_XL without MTP reaches 97 tok/s.

2026-05-221 min read

SourceReddit · reddit-localllama

A detailed benchmark of Qwen 3.6 on an RTX 5080 reveals that Multi-Token Prediction, which just merged into mainline llama.cpp, actually degrades throughput for mixture-of-experts models on consumer hardware. The post from the LocalLLaMA community tested three configurations at realistic 128k contex...

Sign in to read the full analysis

Free — just an email. Get full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.

Get started for free Sign in

Method & sources

Source type: Community signal (Reddit) — our summary + analysis
Source link: Reddit · reddit-localllama
Published: 2026-05-22 13:36:21 UTC
Byline: By the gotcontext.ai team (editorial standards)
Correction?: corrections@gotcontext.ai

← All Intelligence