Vision Tokens Are Expensive and Nobody Reads the Pricing Page
Claude charges (width x height) / 750 tokens per image. A 1920x1080 screenshot costs ~2,765 tokens on Opus 4.7. Here's what that means for agents that use screenshots routinely.
You added image support to your app. You sent a 1,000×1,000 pixel screenshot to Claude. You were charged for 1,334 tokens.
You didn't write 1,334 tokens. You sent one image. But the model doesn't see images the way a human does. It converts them into token representations first, and that conversion is expensive in ways that the words "vision support" don't communicate.
How Anthropic Calculates Image Tokens ¶
Claude's vision pricing follows a specific formula documented in Anthropic's API reference:
``
tokens = (width × height) / 750
``
This is applied after the image is resized to fit within the model's maximum dimensions. For standard Claude models (Claude 3.5, Claude 3.7), the maximum is approximately 1,568 pixels on the long edge before resizing kicks in. For Claude Opus 4.7, the limit is higher. The model supports images up to approximately 4,784 tokens per image.
The official Anthropic documentation provides a reference table:
| Image size | Token count |
|---|---|
| 200×200 px | ~53 tokens |
| 1000×1000 px | ~1,334 tokens |
| 1092×1092 px | ~1,590 tokens |
These are not edge cases. A standard 1080p screenshot (the kind a browser automation agent might capture to verify a UI state) costs nearly 2,800 tokens on Opus 4.7. At Claude's current output pricing, that's the token equivalent of a paragraph of reasoning, paid just for the image ingestion.
Where Vision Token Costs Compound ¶
Single images at small scale are not the issue. The issue is systems that use vision as a routine part of their workflow:
Browser automation agents that take screenshots to verify navigation steps. A 20-step workflow with one screenshot per step sends 20 images. At 1,334 tokens each for 1000×1000 images, that's 26,680 tokens per run, before any text context.
Document processing pipelines that convert PDFs to images before sending to the model. A 10-page PDF rendered at standard resolution can easily exceed 15,000 vision tokens, more than many systems' entire text context budget.
UI testing systems that use vision models to verify component rendering. Continuous integration systems running 50 test cases per commit, each with 3–5 screenshots, accumulate vision token costs that dwarf the text token costs in the same pipeline.
Multi-modal RAG systems that index product images alongside text. Retrieval returns N images plus text chunks. Each image in the retrieved set costs 1,000–1,500 tokens before the model reads the actual query.
Strategies for Reducing Vision Token Spend ¶
Resize before sending. The formula is linear in pixel count. A 1000×1000 image at 1,334 tokens becomes ~334 tokens at 500×500. If your task doesn't require fine-grained detail (verifying that a button exists, checking that a form rendered, confirming a layout didn't break), resizing to 500×500 or smaller cuts costs by 75% with minimal accuracy impact.
Crop to the region of interest. Sending a full-page screenshot when you care about a 200×300 pixel UI component wastes everything outside that region. Cropping to the component before sending reduces vision tokens proportionally.
Use text extraction as a pre-filter. Many vision tasks are actually text extraction tasks. If your image contains structured text (a table, a form, a code block), extracting the text first (via OCR or a lighter vision call) and sending the extracted text to the main model is dramatically cheaper than sending the image directly.
Cache vision representations for repeated images. Anthropic's prompt caching applies to image tokens the same way it applies to text tokens. If your system sends the same base screenshot repeatedly with different questions, prompt caching eliminates the repeated image token cost after the first call.
Compress your text context to leave room. Vision tokens and text tokens share the same context budget. If your text context is bloated (large system prompts, accumulated conversation history, verbose few-shot examples), you're competing for budget against your images. Compressing text context gives vision tokens more headroom without hitting the model's limits.
The Math on a Typical Agent Workflow ¶
Assume a browser agent running a 15-step task with:
Total per run: roughly 130,000–150,000 tokens. Vision is accounting for ~30% of that.
Shrink each screenshot to 800×600 and apply gotcontext.ai compression to the system prompt and conversation history:
New total: ~13,100–16,100 tokens per run. That's an 89% reduction. Same task, same model, no logic changes.
Vision tokens aren't optional if your application uses images. But how many you spend per image, and how much text context competes for the same budget, is entirely within your control.
Compress your text context and give your vision budget room to breathe →
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{vision-tokens-hidden-cost-multimodal-2026,
title = {Vision Tokens Are Expensive and Nobody Reads the Pricing Page},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/vision-tokens-hidden-cost-multimodal},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 8). Vision Tokens Are Expensive and Nobody Reads the Pricing Page. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/vision-tokens-hidden-cost-multimodal.