Économies mesurées sur 11 LLMs — Claude Opus 4.7 à Gemini Flash.→ Voir les données par modèle
Connecter votre client
Tooling

ik_llama.cpp achieves 110 tok/s on 12GB GPU with Qwen 35B

A developer benchmarked ik_llama.cpp against standard llama.cpp on an RTX 4070 Super and found speculative decoding throughput jumped from 89.76 tok/s to over 110 tok/s using the same Qwen3.6-35B model.

1 min read

A developer running local inference on an RTX 4070 Super 12GB GPU reported that switching from llama.cpp to ik_llama.cpp for speculative decoding increased throughput to 110 tokens per second on...

Sign in to read the full analysis

Free account. Full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.

Method & sources
Source type
Primary publication (lab/vendor blog) — our analysis + implication
Source link
r/localllama
Published
UTC
Updated
UTC
Byline
By the gotcontext.ai team (editorial standards)
Correction?
corrections@gotcontext.ai
ik_llama.cpp achieves 110 tok/s on 12GB GPU with Qwen 35B — gotcontext.ai