Économies mesurées sur 11 LLMs — Claude Opus 4.7 à Gemini Flash.→ Voir les données par modèle
Connecter votre client
Tooling

Qwen3.6 27B achieves 45+ tokens/sec on consumer GPUs with llama.cpp

Alibaba's Qwen3.6 27B model running on two AMD RX 9070 XTs via llama.cpp delivers 45–51 tokens per second with speculative decoding, proving dense open models remain viable for local agentic workloads.

1 min read

Alibaba's Qwen3.6 27B model, deployed on consumer-grade AMD GPUs via llama.cpp, is delivering throughput and reasoning quality that challenges the assumption that dense open models require cloud infrastructure for practical use.

A developer running the model on two RX 9070 XTs (PCIe 5.0 x8/x8, powe...

Sign in to read the full analysis

Free — just an email. Get full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.

Method & sources
Source type
Primary publication (lab/vendor blog) — our analysis + implication
Source link
r/localllama
Published
UTC
Byline
By the gotcontext.ai team (editorial standards)
Correction?
corrections@gotcontext.ai
Qwen3.6 27B achieves 45+ tokens/sec on consumer GPUs with llama.cpp — gotcontext.ai