Vision Tokens Are Expensive and Nobody Reads the Pricing Page

You added image support to your app. You sent a 1,000×1,000 pixel screenshot to Claude. You were charged for 1,334 tokens.

You didn't write 1,334 tokens. You sent one image. But the model doesn't see images the way a human does. It converts them into token representations first, and that conversion is expensive in ways that the words "vision support" don't communicate.

How Anthropic Calculates Image Tokens ¶

Claude's vision pricing follows a specific formula documented in Anthropic's API reference:

``tokens = (width × height) / 750``

This is applied after the image is resized to fit within the model's maximum dimensions. For standard Claude models (Claude 3.5, Claude 3.7), the maximum is approximately 1,568 pixels on the long edge before resizing kicks in. For Claude Opus 4.7, the limit is higher. The model supports images up to approximately 4,784 tokens per image.

The official Anthropic documentation provides a reference table:

Image size	Token count
200×200 px	~53 tokens
1000×1000 px	~1,334 tokens
1092×1092 px	~1,590 tokens

For Claude Opus 4.7 specifically, larger images are supported without the standard resize ceiling:

1920×1080 px → ~2,765 tokens

2000×1500 px → ~4,000 tokens

These are not edge cases. A standard 1080p screenshot (the kind a browser automation agent might capture to verify a UI state) costs nearly 2,800 tokens on Opus 4.7. At Claude's current output pricing, that's the token equivalent of a paragraph of reasoning, paid just for the image ingestion.

Where Vision Token Costs Compound ¶

Single images at small scale are not the issue. The issue is systems that use vision as a routine part of their workflow:

Browser automation agents that take screenshots to verify navigation steps. A 20-step workflow with one screenshot per step sends 20 images. At 1,334 tokens each for 1000×1000 images, that's 26,680 tokens per run, before any text context.

Document processing pipelines that convert PDFs to images before sending to the model. A 10-page PDF rendered at standard resolution can easily exceed 15,000 vision tokens, more than many systems' entire text context budget.

UI testing systems that use vision models to verify component rendering. Continuous integration systems running 50 test cases per commit, each with 3 to 5 screenshots, accumulate vision token costs that dwarf the text token costs in the same pipeline.

Multi-modal RAG systems that index product images alongside text. Retrieval returns N images plus text chunks. Each image in the retrieved set costs 1,000 to 1,500 tokens before the model reads the actual query.

Strategies for Reducing Vision Token Spend ¶

Resize before sending. The formula is linear in pixel count. A 1000×1000 image at 1,334 tokens becomes ~334 tokens at 500×500. If your task doesn't require fine-grained detail (verifying that a button exists, checking that a form rendered, confirming a layout didn't break), resizing to 500×500 or smaller cuts costs by 75% with minimal accuracy impact.

Crop to the region of interest. Sending a full-page screenshot when you care about a 200×300 pixel UI component wastes everything outside that region. Cropping to the component before sending reduces vision tokens proportionally.

Use text extraction as a pre-filter. Many vision tasks are actually text extraction tasks. If your image contains structured text (a table, a form, a code block), extracting the text first (via OCR or a lighter vision call) and sending the extracted text to the main model is dramatically cheaper than sending the image directly.

Cache vision representations for repeated images. Anthropic's prompt caching applies to image tokens the same way it applies to text tokens. If your system sends the same base screenshot repeatedly with different questions, prompt caching eliminates the repeated image token cost after the first call.

Compress your text context to leave room. Vision tokens and text tokens share the same context budget. If your text context is bloated (large system prompts, accumulated conversation history, verbose few-shot examples), you're competing for budget against your images. Compressing text context gives vision tokens more headroom without hitting the model's limits.

The Math on a Typical Agent Workflow ¶

Assume a browser agent running a 15-step task with:

One 1920×1080 screenshot per step: 15 × 2,765 = 41,475 vision tokens

A 5,000-token system prompt per call: 15 × 5,000 = 75,000 text tokens

2,000 tokens of conversation history per call (growing): ~15,000 to 30,000 text tokens

Total per run: roughly 130,000 to 150,000 tokens. Vision is accounting for ~30% of that.

Shrink each screenshot to 800×600 and apply gotcontext.ai compression to the system prompt and conversation history:

Vision: 15 × (800×600/750) = 15 × 640 = 9,600 tokens (down from 41,475)

System prompt compressed 10×: 500 tokens (down from 5,000)

History compressed 5×: 3,000 to 6,000 tokens

New total: ~13,100 to 16,100 tokens per run. That's an 89% reduction. Same task, same model, no logic changes.

Vision tokens aren't optional if your application uses images. But how many you spend per image, and how much text context competes for the same budget, is entirely within your control.

Compress your text context and give your vision budget room to breathe →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{vision-tokens-hidden-cost-multimodal-2026,
  title  = {Vision Tokens Are Expensive and Nobody Reads the Pricing Page},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/vision-tokens-hidden-cost-multimodal},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Vision Tokens Are Expensive and Nobody Reads the Pricing Page. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/vision-tokens-hidden-cost-multimodal.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts