Tooling
RTX 5080 16GB runs Qwen 35B at 56 tok/s on 128k context—but MTP slows it down
A LocalLLaMA benchmark shows Multi-Token Prediction hurts MoE model speed on consumer GPUs because the compute buffer forces expert layers to CPU. The 35B Q4_K_XL without MTP reaches 97 tok/s.
1 min read
A detailed benchmark of Qwen 3.6 on an RTX 5080 reveals that Multi-Token Prediction, which just merged into mainline llama.cpp, actually degrades throughput for mixture-of-experts models on consumer hardware. The post from the LocalLLaMA community tested three configurations at realistic 128k contex...
Sign in to read the full analysis
Free — just an email. Get full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.
Method & sources
- Source type
- Community signal (Reddit) — our summary + analysis
- Source link
- Reddit · reddit-localllama
- Published
- UTC
- Byline
- By the gotcontext.ai team (editorial standards)
- Correction?
- corrections@gotcontext.ai