The Batch API Playbook: 50% Off for Workloads That Can Wait

The cheapest optimization you are not using ¶

OpenAI's Batch API charges 50% of standard pricing for requests processed within 24 hours. Not 5% cheaper. Not 15% for high-volume customers. Half price, available to every account, today.

The engineering cost to adopt it: two API calls and a .jsonl file.

If you are paying $10,000/month on GPT-4o for document classification, nightly report generation, or bulk embedding, the path to $5,000/month is a half-day of work.

What qualifies ¶

Batch works for any workload where you can tolerate up to 24 hours of latency. The practical categories:

Nightly reports. Summarize the day's activity, generate the weekly digest, produce the Monday standup brief.

Document indexing. Extract entities, classify documents, generate embeddings; all of this is batch-safe.

Evaluation runs. LLM-as-judge evals on your test set do not need real-time responses.

Data enrichment. Product description generation, SEO metadata, schema extraction from raw documents.

Offline analysis. Sentiment analysis on customer support tickets, classification of inbound emails, categorization of log messages.

The disqualifying criteria is just: the user is waiting. If a human expects a response in under a minute, it is not a batch workload.

How it works ¶

Three steps: build a .jsonl file, upload it, poll for completion.

Step 1: Build your request file

``python import json

requests = [ { "custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o", "messages": [ {"role": "system", "content": "Classify the sentiment: positive, neutral, or negative."}, {"role": "user", "content": document} ], "max_tokens": 10 } } for i, document in enumerate(documents) ]

with open("batch_requests.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")`

`Step 2: Submit the batch`

`python from openai import OpenAI

client = OpenAI()

batch_input_file = client.files.create( file=open("batch_requests.jsonl", "rb"), purpose="batch" )

batch = client.batches.create( input_file_id=batch_input_file.id, endpoint="/v1/chat/completions", completion_window="24h" )

print(f"Batch ID: {batch.id}")`

`Step 3: Retrieve results`

`python import time

while True: batch = client.batches.retrieve(batch.id) if batch.status == "completed": break time.sleep(60)

content = client.files.content(batch.output_file_id) results = [json.loads(line) for line in content.text.strip().split("\n")]`

Each result maps back to your custom_id. Failed requests are in a separate error file; you can resubmit only the failures.

`The math ¶`

Say you run nightly document classification on 10,000 documents, averaging 500 input tokens and 10 output tokens each.

Standard pricing (gpt-4o as of May 2026):

Input: 10,000 x 500 = 5M tokens at $2.50/M = $12.50


Output: 10,000 x 10 = 100K tokens at $10/M = $1.00
Nightly cost: $13.50
Batch pricing (50% off):
Nightly cost: $6.75
Annual savings: ~$2,465
For 100,000 documents/night, that is $24,650/year for adopting an asynchronous queue you already effectively have.
Compound it with context compression ¶
Batch discount and context compression are orthogonal. A 500-token document that compresses to 150 tokens before it hits the API drops your input cost by 70%. Combine the two:
Uncompressed, real-time: 100% of cost
Compressed, real-time: ~32% of cost
Uncompressed, batch: ~50% of cost
Compressed, batch: ~16% of cost

For offline workloads, compressed batch processing costs roughly one-sixth of naive real-time inference. The engineering effort is a .jsonl` formatter and a compression call.

What to watch for ¶

Rate limits still apply per batch. Very large batches (>50K requests) need to be split. The API will error and tell you the limit.

Batch quotas exist per organization. Your first batches will hit a lower limit that increases with usage history.

Output tokens cost the same per-token in batch as in real-time. Only input tokens get the discount. Verify your token mix.

24h SLA is a ceiling, not a floor. Most batches complete in 1-4 hours. Do not assume you have 24 hours if your downstream job depends on the results.

Start compressing before you batch →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{batch-api-50-percent-off-async-workloads-2026,
  title  = {The Batch API Playbook: 50% Off for Workloads That Can Wait},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/batch-api-50-percent-off-async-workloads},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). The Batch API Playbook: 50% Off for Workloads That Can Wait. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/batch-api-50-percent-off-async-workloads.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts