Fine-tuned Gemma 4 26B shows 3–5 second E2E latency despite low TTFT

A fine-tuned Gemma 4 26B model on H100 hardware exhibits end-to-end latency of 3–5 seconds despite time-to-first-token performance of 100–300 ms, highlighting a common gap between prompt and generation speed in quantized

2026-05-261 min read

Sourcer/machinelearning

A machine learning engineer reported high end-to-end latency on a fine-tuned Gemma 4 26B model despite achieving reasonable time-to-first-token (TTFT) performance on H100 hardware. The mod...

Sign in to read the full analysis

Free — just an email. Get full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.

Get started for free Sign in

Method & sources

Source type: Primary publication (lab/vendor blog) — our analysis + implication
Source link: r/machinelearning
Published: 2026-05-26 21:53:12 UTC
Byline: By the gotcontext.ai team (editorial standards)
Correction?: corrections@gotcontext.ai

← All Intelligence

Fine-tuned Gemma 4 26B shows 3–5 second E2E latency despite low TTFT — gotcontext.ai