Guide
How LLM API pricing actually works (and where it bites you)
Output tokens cost 4-5x input. Cached prompts cost 10x less. Most billing surprises come from misunderstanding these two numbers.
By Buğra SözeriPublished Updated
Every major LLM API — OpenAI, Anthropic, Google, Meta-via- cloud — charges by the token. The marketing pages quote prices like “$5 per million input tokens, $15 per million output tokens.” The math looks simple. Five places where the actual bill diverges from the simple estimate:
1. Output costs 4-5× input
Every modern frontier model charges meaningfully more for output than input. Typical ratios at the time of writing: OpenAI GPT-4 family ~5x, Claude family ~3-5x, Gemini family ~4x. The economics is straightforward: input tokens are consumed by the model’s context-processing pass once; output tokens are generated one at a time through dozens or hundreds of forward passes.
Practical implication: long-context retrieval-augmented applications (where you stuff a lot of context in and ask for a short answer) are cheaper per useful answer than long-generation applications (where the model writes pages). If your bill is high and you’re generating little output, the input bloat is the culprit. If you’re generating a lot of output, focus on shorter outputs first.
2. Cached prompts are radically cheaper
OpenAI and Anthropic both offer prompt caching: input tokens that match a recently-seen prefix bill at 10-90% off regular input pricing. The cache typically lives 5-10 minutes. Cache hit rates depend on how predictable your prompts are.
Practical implication: design prompts so the prefix is stable across calls. Put the system instructions and any static context at the top; put the user’s per-request variation at the bottom. A chatbot with a consistent system prompt can see input bills drop 70-90% from cache hits across a multi-turn conversation.
3. Batch APIs are 50% off
OpenAI’s batch endpoint and Anthropic’s message- batching API both offer 50% off list pricing in exchange for async delivery (typically within 24 hours). For workloads that don’t need immediate responses — overnight data processing, content generation pipelines, embedding backfills — switching to batch is free 50% savings.
4. Tier-down models on retrieval steps
A common pattern in production AI: a chain of model calls where the first step is “decide what to retrieve” and the second step is “answer using what was retrieved.” The decision step rarely needs the smartest available model — GPT-4o-mini or Claude Haiku is usually plenty. Reserving the frontier-tier model for the final answer step typically cuts pipeline cost 80-90% with minimal quality impact.
5. Estimate output length aggressively
The single biggest source of billing surprises: you assume the model will produce a short answer; it produces a long one. A “max_tokens: 4096” safety limit means you might pay for 4096 output tokens per call. Most APIs bill what was generated, not what was requested, but the habit of allowing 4096 sets the budget assumption wrong.
Practical: set max_tokensto roughly 1.5× the length you actually expect, not the maximum you’d tolerate. Lower max_tokens limits also push the model to produce shorter responses (it adapts based on the budget signal). The savings compound.
The estimation tool
Our AI token counter estimates input tokens and computes per-call cost across the major model families. It uses character-ratio heuristics (within ~10% accuracy for English; less accurate for code and non-Latin scripts) so the estimate is rough but useful for sizing decisions. For exact cost forecasting, use the vendor’s official tokeniser library.
Worked example: a customer-support chatbot at 100K conversations/month
Concrete pipeline. Each user turn includes a 3,500-token system prompt (product docs, tone guidelines, refusal rules), an average 200-token user message, and an average 400-token model response. Conversations average 4 turns. Per conversation:
- Input per turn: 3,500 (system) + accumulated history + 200 (new user) ≈ 3,700 first turn, growing to ~5,800 by turn 4. Average per turn ~4,750.
- Total input/conversation: 4 × 4,750 = 19,000 input tokens
- Total output/conversation: 4 × 400 = 1,600 output tokens
Naive cost with Claude Sonnet 4 ($3 per million input, $15 per million output) at 100K conversations:
- Input: 100,000 × 19,000 × $3 / 1M = $5,700
- Output: 100,000 × 1,600 × $15 / 1M = $2,400
- Total: $8,100/month
Now apply prompt caching. The 3,500-token system prompt is identical across all 100K × 4 = 400K turns. With Anthropic’s cache (cache reads at $0.30/M, 90% discount on cached input), only the user messages and growing history pay full price. Cached portion: 400K × 3,500 × $0.30 / 1M = $420. Uncached: 400K × ~1,250 × $3 / 1M = $1,500.
- New input cost: $420 + $1,500 = $1,920 (down from $5,700)
- Output unchanged: $2,400
- New total: $4,320/month
47% reduction with one config change. Switch the easy 30% of conversations (those that don’t need the full model) to Haiku 4.5 at $1/$5 per million, and the bill drops another ~$1,000 to roughly $3,300. The total saving — 59% — comes from caching and tiering, neither of which is automatic.
Common mistakes that inflate the bill
- Putting the user message at the top of the prompt. Cache keys hash from the prefix. If your prompt structure is
[user variation] [static system]the cache never hits. Always put the static parts first. - Setting
max_tokensto the model ceiling. Most APIs bill the actual generation, not the limit — but the model uses the limit as a length signal. Settingmax_tokens: 4096when you wanted a 200-token answer produces longer answers and a bigger bill. - Embedding every document repeatedly. Retrieval pipelines that re-embed the same corpus on every query are paying for embeddings they already have. Cache embeddings in your vector store; bill should be near-zero after the initial backfill.
- Using GPT-4 / Opus / Gemini Pro for classification. A 5-class intent classifier almost never needs a frontier model. Haiku, GPT-4o-mini, or Gemini Flash run 10-30× cheaper and match accuracy on tasks under ~10 output tokens.
- Streaming when you don’t need to. Streaming is free of additional charge, but each token is paid the moment it’s generated. If you abort mid-stream due to a downstream timeout, you still owe for what was produced. Set hard per-request timeouts in your client.
When this guide does NOT apply
- Self-hosted / open-weights models. Llama, Mistral, Qwen on your own GPUs convert per-token API cost into GPU-hour cost. The economics is dominated by utilisation (a $4/hr H100 wasted on idle time still bills) and not by tokens. The right cost model is GPU hours × duty cycle, not tokens × rate.
- Fine-tuned and dedicated-capacity deployments.OpenAI’s Provisioned Throughput Units, Anthropic’s reserved capacity, and Google’s “Provisioned Throughput” all bill flat per-month for guaranteed capacity. At high QPS this is cheaper than per-token; at low QPS much more expensive. Break-even is roughly the point where your per-token bill would exceed 60% of the reserved-capacity SKU.
- Embedding-only workloads.Embedding models are 100-1000× cheaper than chat completion (typically $0.02-0.13 per million tokens for text-embedding-3-small or voyage-3). The five levers above mostly don’t apply; the bill is dominated by corpus size and embedding frequency.
For working definitions of the units underneath the billing, see our GPT token glossary entry and the context window entry. For a concrete cost-by-model comparison, the LLM cost calculator handles per-vendor rate sheets and the OpenAI prompt-caching docs spell out the cache-eligibility rules for the OpenAI side.
The honest summary
At small scale (a few thousand calls a month) LLM pricing is so cheap nothing here matters. At medium-to-large scale, the gap between the naive cost estimate and the actual bill can easily be 5-10× when you account for output bloat, cache misses, and unnecessarily-using-the-frontier-model. Each of the five levers above can independently save 50-90% on specific call patterns. Audit your prompt patterns once, set up caching where the structure allows, and the bill becomes predictable.
Token-counting tools and their accuracy
Each major vendor uses a different tokeniser. Counting tokens in advance — for budgeting or quota management — requires the matching library:
- OpenAI tiktoken. The canonical tokeniser for GPT-4 and earlier OpenAI models. BPE variant, ~4 chars/token for English, ~2 chars/token for code. Available as a Python and JS library.
- Anthropic tokeniser.Claude uses a proprietary BPE tokeniser. Counts are exposed through the API’s response metadata; the SDK now ships a client-side counter for budgeting.
- Google sentencepiece (Gemini). Different BPE variant. Counts via the Gemini API’s
count_tokensendpoint. - Character-ratio heuristics. For rough estimation across vendors, English text averages 4 chars/token; code averages 2.5; non-Latin scripts (Chinese, Japanese, Arabic) average 1-2. Use ÷4 for a quick estimate, then verify with the vendor tokeniser before billing.
Per-million-token rate sheet (early 2026)
Vendor pricing as of writing. Rates change frequently; always confirm against the vendor’s pricing page before committing to a budget.
| Model | Input ($/M tok) | Output ($/M tok) | Cached input |
|---|---|---|---|
| OpenAI GPT-4.1 | $2.00 | $8.00 | $0.50 (75% off) |
| OpenAI GPT-4.1 mini | $0.40 | $1.60 | $0.10 |
| OpenAI o1 | $15.00 | $60.00 | $7.50 |
| Anthropic Claude Opus 4 | $15.00 | $75.00 | $1.50 (90% off) |
| Anthropic Claude Sonnet 4 | $3.00 | $15.00 | $0.30 |
| Anthropic Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 |
| Google Gemini 2.5 Pro | $1.25 | $10.00 | $0.31 |
| Google Gemini 2.5 Flash | $0.30 | $2.50 | $0.075 |
Three patterns to notice. First, the output-to-input ratio is 4-5× across nearly every vendor — this is a market-wide architectural cost, not a vendor strategy. Second, the gap between the cheapest and most expensive frontier model is 12-15× — for tasks where Haiku or Gemini Flash work, the savings vs Opus are dramatic. Third, cached input typically costs 10-25% of regular input — caching is the single highest-leverage optimisation on input-heavy workloads.
For a worked rate-comparison across your actual prompt shapes, the LLM cost calculator takes your input/output token estimates and computes monthly bills across all major vendors at once.
Frequently asked questions
- Why do LLM APIs charge more for output tokens than input tokens?
- Input tokens are processed in a single parallel pass through the model; output tokens are generated one at a time through dozens or hundreds of sequential forward passes. The compute cost per output token is 4–5× higher, which is reflected in pricing across OpenAI, Anthropic, and Google.
- What is prompt caching and how much can it reduce my LLM API bill?
- Prompt caching stores the KV cache for a repeated prompt prefix and charges 10–25% of normal input rates on cache hits. A chatbot with a 3,500-token system prompt repeated across 400,000 turns can reduce input costs by 70–90% — the single highest-leverage optimization for input-heavy workloads.
- How much does the OpenAI or Anthropic batch API discount?
- Both OpenAI's batch endpoint and Anthropic's message-batching API offer 50% off list pricing in exchange for async delivery within 24 hours. For data-processing pipelines and content-generation jobs that don't need immediate responses, this is free cost savings.
- What is the rough cost of running a customer-support chatbot on Claude Sonnet 4 at 100,000 conversations per month?
- Without optimization: approximately $8,100/month. With prompt caching on the static system prompt: approximately $4,320/month (47% reduction). Adding model tiering (routing simpler conversations to Haiku) reduces the bill further to around $3,300/month — a 59% total saving.
- How many tokens are in a typical English word?
- Roughly 1.3 tokens per word (about 4 characters per token) for English prose. Code averages about 2.5 characters per token. Non-Latin scripts like Chinese and Japanese average 1–2 characters per token and are proportionally more expensive to process.
Sources & references
Authoritative references cited by this piece. Verified by Buğra Sözeri on the dates shown and re-checked at every deploy.
- OpenAI — API pricing — Authoritative per-million-token rates for GPT-4o, GPT-4o-mini, o1, and embeddings used in the cost models(as of )
- Anthropic — Claude pricing — Reference for Claude model rates including prompt-caching discounts discussed in the article(as of )
- Google — Gemini API pricing — Reference for Google Gemini per-token pricing cited in the cross-vendor comparison(as of )
- OpenAI — tiktoken tokenizer — Canonical reference for the BPE token-counting model underlying every dollar figure(as of )
Related
Published May 14, 2026 · Last reviewed May 31, 2026