Guide
How GPT tokenization actually works (and why your bill depends on it)
Common words = 1 token. Rare words = 2-5 tokens. Emoji, code symbols, and non-English text spend tokens fast.
Every API call to a large language model is metered intokens — neither characters nor words. Tokens are the unit the model actually consumes after text is encoded by a byte-pair encoding(BPE) tokenizer. Understanding how tokenization works matters because (a) it determines your bill, and (b) it’s why “a 1000-word document” doesn’t map cleanly to “a 1000-token cost estimate.”
How BPE tokenizers work
The algorithm in a few lines:
- Start with a vocabulary of single bytes (256 entries).
- Find the most-frequent adjacent pair of vocabulary entries in a large training corpus.
- Add that pair as a new vocabulary entry.
- Repeat until vocabulary reaches the target size (50k-100k entries for modern models).
- To tokenize new text, greedily replace pairs from longest to shortest match.
Result: common English words like “the”, “and”, “understanding” become a single token each. Rare words like “rambunctious” get split into 2-4 tokens (e.g., “ram”+“bunct”+“ious”). Emoji and non-English characters often take 2-6 tokens each.
Token counts for common content
Approximate token counts for GPT-4 (cl100k_base tokenizer):
| Content | Tokens | Tokens / word |
|---|---|---|
| English prose (this page) | ~1.3 / word | 1.3 |
| News articles | ~1.3 / word | 1.3 |
| Technical / scientific writing | ~1.5 / word | 1.5 |
| Programming code (Python) | ~2 / word | 2.0 |
| JSON / XML (lots of punctuation) | ~2.5 / word | 2.5 |
| Spanish / French / German | ~1.6 / word | 1.6 |
| Russian / Greek (Cyrillic / Greek script) | ~3-4 / word | 3-4 |
| Chinese (simplified) | ~1.5 / character | 1.5/char |
| Japanese / Korean | ~1-2 / character | 1-2/char |
| Emoji ✨ | ~2-3 each | — |
The reference figure for English is ~750 words per 1,000 tokens. Non-Latin scripts cost significantly more tokens per character because they weren’t represented as densely in the training corpus.
Why the cost gap matters
Per-token pricing means non-English content costs 2-4× more for the same idea. A 1,000-word document costs:
- English: ~1,300 tokens → $0.013 at GPT-4o input price (~$10/M tokens).
- Russian: ~3,500 tokens → $0.035 (2.7× more for the same content).
- Chinese: ~1,500 tokens (per character, dense scripts compensate slightly) → $0.015.
For a translation business or a multilingual support system, the per-language cost asymmetry compounds quickly. Anthropic, OpenAI, and Google publish per-model token costs; the actual content cost depends on what language and format you’re paying for.
Tokenizer differences across models
Each model family has its own tokenizer:
- OpenAI cl100k_base (GPT-3.5, GPT-4): ~100,000-token vocabulary. The reference modern English tokenizer.
- OpenAI o200k_base (GPT-4o, o-series): 200,000-token vocabulary. Better at non-English and code. A given document needs ~10-15% fewer tokens than cl100k.
- Anthropic Claude tokenizer: proprietary. Approximately similar density to cl100k for English; differs measurably for code and non-English. Anthropic publishes a token counting endpoint to estimate before submitting.
- Google Gemini: uses SentencePiece. Roughly comparable density to cl100k.
Implication: the same prompt sent to GPT-4o vs Claude vs Gemini doesn’t produce identical token counts. Budgeting across providers requires per-provider token estimation, not a single “1 word ≈ 1.3 tokens” rule.
Where tokenization affects prompt design
- Long-context costs. A 100k-token context window holding all your documentation is great until you realise the per-call cost is $1+ for typical usage. Token counts compound across multi-turn conversations.
- JSON vs natural language.Asking for JSON output costs ~30-50% more tokens than asking for comparable plain prose. JSON’s punctuation gets tokenized aggressively.
- Code tasks. Code is roughly 2× denser in tokens than prose. A 200-line file might be 2,000-3,000 tokens. Tooling that includes your whole repo as context adds up fast.
- Non-English languages. 2-4× more tokens per character. For multilingual products this is a first-order cost.
How to estimate tokens before you pay
- Use a token-count tool. Our AI token counter implements multiple tokenizers and reports the exact count for your input.
- Use the official tokenizer library. OpenAI’s
tiktoken(Python), Anthropic’s tokenizer API, or hosted token counters. These are the ground truth for billing. - Rule of thumb. For English prose: 1 word ≈ 1.3 tokens. For code: 1 line ≈ 8-15 tokens. For Chinese: 1 character ≈ 1.5 tokens.
- Budget the output too. Many providers charge more for output than input (typically 3-5× per token). A 2000-token output is more expensive than a 2000-token input.
The deeper structural reason for BPE
Modern LLMs see tokens, not characters. The model’s embeddings, attention, and output are all defined over a finite token vocabulary. Character-level models exist but are slower (each character is one input position) and harder to train. Word-level models can’t handle unseen words (out-of-vocabulary problem). BPE is the compromise that won.
For deeper background, see our GPT token glossary entry and the how token pricing works guide.
Sources: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016, the foundational BPE paper); OpenAI tiktoken repository (2024); Anthropic developer documentation on tokens and context (2024).
Related
Published May 16, 2026