Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Cost

The Batch API Playbook: 50% Off for Workloads That Can Wait

OpenAI charges half price for requests processed within 24 hours. Nightly reports, bulk classification, and offline enrichment all qualify. The engineering cost: two API calls and a .jsonl file. Stack batch pricing with context compression and you are at roughly one-sixth of naive real-time cost.

James Hollingsworth(Contributor)Published 5 min~511 words

The cheapest optimization you are not using

OpenAI's Batch API charges 50% of standard pricing for requests processed within 24 hours. Not 5% cheaper. Not 15% for high-volume customers. Half price, available to every account, today.

The engineering cost to adopt it: two API calls and a .jsonl file.

If you are paying $10,000/month on GPT-4o for document classification, nightly report generation, or bulk embedding, the path to $5,000/month is a half-day of work.

What qualifies

Batch works for any workload where you can tolerate up to 24 hours of latency. The practical categories:

  • Nightly reports. Summarize the day's activity, generate the weekly digest, produce the Monday standup brief.
  • Document indexing. Extract entities, classify documents, generate embeddings; all of this is batch-safe.
  • Evaluation runs. LLM-as-judge evals on your test set do not need real-time responses.
  • Data enrichment. Product description generation, SEO metadata, schema extraction from raw documents.
  • Offline analysis. Sentiment analysis on customer support tickets, classification of inbound emails, categorization of log messages.
  • The disqualifying criteria is just: the user is waiting. If a human expects a response in under a minute, it is not a batch workload.

    How it works

    Three steps: build a .jsonl file, upload it, poll for completion.

    Step 1: Build your request file

    ``python import json

    requests = [ { "custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o", "messages": [ {"role": "system", "content": "Classify the sentiment: positive, neutral, or negative."}, {"role": "user", "content": document} ], "max_tokens": 10 } } for i, document in enumerate(documents) ]

    with open("batch_requests.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n") `

    Step 2: Submit the batch

    `python from openai import OpenAI

    client = OpenAI()

    batch_input_file = client.files.create( file=open("batch_requests.jsonl", "rb"), purpose="batch" )

    batch = client.batches.create( input_file_id=batch_input_file.id, endpoint="/v1/chat/completions", completion_window="24h" )

    print(f"Batch ID: {batch.id}") `

    Step 3: Retrieve results

    `python import time

    while True: batch = client.batches.retrieve(batch.id) if batch.status == "completed": break time.sleep(60)

    content = client.files.content(batch.output_file_id) results = [json.loads(line) for line in content.text.strip().split("\n")] `

    Each result maps back to your custom_id. Failed requests are in a separate error file; you can resubmit only the failures.

    The math

    Say you run nightly document classification on 10,000 documents, averaging 500 input tokens and 10 output tokens each.

    Standard pricing (gpt-4o as of May 2026):

  • Input: 10,000 x 500 = 5M tokens at $2.50/M = $12.50
  • Output: 10,000 x 10 = 100K tokens at $10/M = $1.00
  • Nightly cost: $13.50
  • Batch pricing (50% off):

  • Nightly cost: $6.75
  • Annual savings: ~$2,465
  • For 100,000 documents/night, that is $24,650/year for adopting an asynchronous queue you already effectively have.

    Compound it with context compression

    Batch discount and context compression are orthogonal. A 500-token document that compresses to 150 tokens before it hits the API drops your input cost by 70%. Combine the two:

  • Uncompressed, real-time: 100% of cost
  • Compressed, real-time: ~32% of cost
  • Uncompressed, batch: ~50% of cost
  • Compressed, batch: ~16% of cost
  • For offline workloads, compressed batch processing costs roughly one-sixth of naive real-time inference. The engineering effort is a .jsonl` formatter and a compression call.

    What to watch for

  • Rate limits still apply per batch. Very large batches (>50K requests) need to be split. The API will error and tell you the limit.
  • Batch quotas exist per organization. Your first batches will hit a lower limit that increases with usage history.
  • Output tokens cost the same per-token in batch as in real-time. Only input tokens get the discount. Verify your token mix.
  • 24h SLA is a ceiling, not a floor. Most batches complete in 1-4 hours. Do not assume you have 24 hours if your downstream job depends on the results.
  • Start compressing before you batch →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{batch-api-50-percent-off-async-workloads-2026,
      title  = {The Batch API Playbook: 50% Off for Workloads That Can Wait},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/batch-api-50-percent-off-async-workloads},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 8). The Batch API Playbook: 50% Off for Workloads That Can Wait. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/batch-api-50-percent-off-async-workloads.

    Contribute