Question 1

How is tokens-per-second measured in these benchmarks?

Accepted Answer

Every result is community-submitted. The convention is decode throughput: tokens generated divided by (completion time minus time-to-first-token), averaged over the full generation. That isolates decode speed from prefill — using total wall-clock time instead inflates the number on long prompts, because it folds the one-time prefill cost into the per-token rate.

Question 2

What is TTFT (time to first token)?

Accepted Answer

TTFT is the latency from request sent to the first output token — the prefill phase. It is reported separately from decode tokens-per-second because they measure different things. Below roughly 500ms feels interactive; TTFT grows with prompt length and with batch size under load.

Question 3

Which inference runtimes and hardware can I benchmark?

Accepted Answer

The submission form covers the common local-inference runtimes — vLLM, Ollama, llama.cpp, LM Studio, SGLang, TensorRT-LLM, and MLX — across NVIDIA, AMD, Apple Silicon, and Intel. Quantizations include GGUF, AWQ, and GPTQ, and multi-GPU plus Mixture-of-Experts configurations are supported with their own fields (GPU count, tensor-parallel split, active vs total parameters).

Question 4

How do I post my own benchmark result?

Accepted Answer

Paste your llama-bench or nvidia-smi output and the form auto-fills the speed and VRAM fields; then add your model, quantization, and hardware and publish. Posting and commenting need a free account — reading every result is fully public, no sign-up required.

Question 5

What makes two benchmark results directly comparable?

Accepted Answer

Four things must match: the model (family, parameter count, and instruct-vs-base variant), the GPU SKU (an RTX 3090 and a 3090 Ti are different rows), the quantization bit-width and family (GGUF Q4_K_M is not interchangeable with AWQ-int4 or GPTQ-int4), and the inference runtime. Differ on any one of those and the numbers describe different configurations, not a head-to-head.

#	Model / Hardware	Context	Batch	Tokens / sec	TTFT	Status	Submitter
1	unsloth/Qwen3.5-4B-MTP-GGUF :: Qwen3.5-4B-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	131K	512	93.9	—	Unverified	ANanonymous
2	lmstudio-community/gpt-oss-20b-GGUF :: gpt-oss-20b-MXFP4.ggufMXFP4cudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	131K	512	45	—	Unverified	ANanonymous
3	unsloth/Qwen3.6-27B-MTP-GGUF :: Qwen3.6-27B-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	66K	256	37	—	Unverified	ANanonymous
4	lmstudio-community/gemma-4-26B-A4B-it-GGUF :: gemma-4-26B-A4B-it-Q4_K_M.ggufQ4_K_McudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	150K	512	35	—	Unverified	ANanonymous
5	unsloth/gemma-4-26B-A4B-it-GGUF :: gemma-4-26B-A4B-it-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	150K	512	26.8	—	Unverified	ANanonymous
6	unsloth/granite-4.1-30b-GGUF :: granite-4.1-30b-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	66K	512	26.4	—	Unverified	ANanonymous
7	unsloth/gemma-4-31B-it-GGUF :: gemma-4-31B-it-Q3_K_M.ggufQ3_K_McudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	60K	512	17.5	—	Unverified	ANanonymous
8	bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF :: nvidia_Nemotron-Cascade-2-30B-A3B-Q4_K_M.ggufQ4_K_McudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	66K	4096	10.5	—	Unverified	ANanonymous
9	unsloth/Qwen3.6-35B-A3B-MTP-GGUF :: Qwen3.6-35B-A3B-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	66K	1024	9.6	—	Unverified	ANanonymous
10	unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF :: NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	131K	4096	6.3	—	Unverified	ANanonymous
11	unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF :: Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.ggufUD-Q4_K_XLcudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	262K	4096	5.7	—	Unverified	ANanonymous
12	ggml-org/gpt-oss-120b-GGUF :: gpt-oss-120b-mxfp4-00001-of-00003.ggufMXFP4cudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	66K	4096	5.3	—	Unverified	ANanonymous
13	bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF :: Qwen_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.ggufQ4_K_McudaRTX 4070 12GB (Ada sm_89) + RTX 5070 12GB (Blackwell sm_120), layer-split via llama.cpp on Ryzen 5800XT Zen 3, 128 GB DDR4, Docker Desktop WSL2	262K	4096	4.5	—	Unverified	ANanonymous

AI inference benchmarks:real rigs, real numbers.

Leaderboard

Frequently asked questions