Guide

How LLM API pricing actually works (and where it bites you)

Output tokens cost 4-5x input. Cached prompts cost 10x less. Most billing surprises come from misunderstanding these two numbers.

Every major LLM API — OpenAI, Anthropic, Google, Meta-via- cloud — charges by the token. The marketing pages quote prices like “$5 per million input tokens, $15 per million output tokens.” The math looks simple. Five places where the actual bill diverges from the simple estimate:

1. Output costs 4-5× input

Every modern frontier model charges meaningfully more for output than input. Typical ratios at the time of writing: OpenAI GPT-4 family ~5x, Claude family ~3-5x, Gemini family ~4x. The economics is straightforward: input tokens are consumed by the model’s context-processing pass once; output tokens are generated one at a time through dozens or hundreds of forward passes.

Practical implication: long-context retrieval-augmented applications (where you stuff a lot of context in and ask for a short answer) are cheaper per useful answer than long-generation applications (where the model writes pages). If your bill is high and you’re generating little output, the input bloat is the culprit. If you’re generating a lot of output, focus on shorter outputs first.

2. Cached prompts are radically cheaper

OpenAI and Anthropic both offer prompt caching: input tokens that match a recently-seen prefix bill at 10-90% off regular input pricing. The cache typically lives 5-10 minutes. Cache hit rates depend on how predictable your prompts are.

Practical implication: design prompts so the prefix is stable across calls. Put the system instructions and any static context at the top; put the user’s per-request variation at the bottom. A chatbot with a consistent system prompt can see input bills drop 70-90% from cache hits across a multi-turn conversation.

3. Batch APIs are 50% off

OpenAI’s batch endpoint and Anthropic’s message- batching API both offer 50% off list pricing in exchange for async delivery (typically within 24 hours). For workloads that don’t need immediate responses — overnight data processing, content generation pipelines, embedding backfills — switching to batch is free 50% savings.

4. Tier-down models on retrieval steps

A common pattern in production AI: a chain of model calls where the first step is “decide what to retrieve” and the second step is “answer using what was retrieved.” The decision step rarely needs the smartest available model — GPT-4o-mini or Claude Haiku is usually plenty. Reserving the frontier-tier model for the final answer step typically cuts pipeline cost 80-90% with minimal quality impact.

5. Estimate output length aggressively

The single biggest source of billing surprises: you assume the model will produce a short answer; it produces a long one. A “max_tokens: 4096” safety limit means you might pay for 4096 output tokens per call. Most APIs bill what was generated, not what was requested, but the habit of allowing 4096 sets the budget assumption wrong.

Practical: set max_tokensto roughly 1.5× the length you actually expect, not the maximum you’d tolerate. Lower max_tokens limits also push the model to produce shorter responses (it adapts based on the budget signal). The savings compound.

The estimation tool

Our AI token counter estimates input tokens and computes per-call cost across the major model families. It uses character-ratio heuristics (within ~10% accuracy for English; less accurate for code and non-Latin scripts) so the estimate is rough but useful for sizing decisions. For exact cost forecasting, use the vendor’s official tokeniser library.

The honest summary

At small scale (a few thousand calls a month) LLM pricing is so cheap nothing here matters. At medium-to-large scale, the gap between the naive cost estimate and the actual bill can easily be 5-10× when you account for output bloat, cache misses, and unnecessarily-using-the-frontier-model. Each of the five levers above can independently save 50-90% on specific call patterns. Audit your prompt patterns once, set up caching where the structure allows, and the bill becomes predictable.

Published May 14, 2026