Guide
How GPT tokenization actually works (and why your bill depends on it)
Common words = 1 token. Rare words = 2-5 tokens. Emoji, code symbols, and non-English text spend tokens fast.
By Buğra SözeriPublished Updated
Every API call to a large language model is metered intokens — neither characters nor words. Tokens are the unit the model actually consumes after text is encoded by a byte-pair encoding(BPE) tokenizer. Understanding how tokenization works matters because (a) it determines your bill, and (b) it’s why “a 1000-word document” doesn’t map cleanly to “a 1000-token cost estimate.”
How BPE tokenizers work
The algorithm in a few lines:
- Start with a vocabulary of single bytes (256 entries).
- Find the most-frequent adjacent pair of vocabulary entries in a large training corpus.
- Add that pair as a new vocabulary entry.
- Repeat until vocabulary reaches the target size (50k-100k entries for modern models).
- To tokenize new text, greedily replace pairs from longest to shortest match.
Result: common English words like “the”, “and”, “understanding” become a single token each. Rare words like “rambunctious” get split into 2-4 tokens (e.g., “ram”+“bunct”+“ious”). Emoji and non-English characters often take 2-6 tokens each.
Token counts for common content
Approximate token counts for GPT-4 (cl100k_base tokenizer):
| Content | Tokens | Tokens / word |
|---|---|---|
| English prose (this page) | ~1.3 / word | 1.3 |
| News articles | ~1.3 / word | 1.3 |
| Technical / scientific writing | ~1.5 / word | 1.5 |
| Programming code (Python) | ~2 / word | 2.0 |
| JSON / XML (lots of punctuation) | ~2.5 / word | 2.5 |
| Spanish / French / German | ~1.6 / word | 1.6 |
| Russian / Greek (Cyrillic / Greek script) | ~3-4 / word | 3-4 |
| Chinese (simplified) | ~1.5 / character | 1.5/char |
| Japanese / Korean | ~1-2 / character | 1-2/char |
| Emoji ✨ | ~2-3 each | — |
The reference figure for English is ~750 words per 1,000 tokens. Non-Latin scripts cost significantly more tokens per character because they weren’t represented as densely in the training corpus.
Why the cost gap matters
Per-token pricing means non-English content costs 2-4× more for the same idea. A 1,000-word document costs:
- English: ~1,300 tokens → $0.013 at GPT-4o input price (~$10/M tokens).
- Russian: ~3,500 tokens → $0.035 (2.7× more for the same content).
- Chinese: ~1,500 tokens (per character, dense scripts compensate slightly) → $0.015.
For a translation business or a multilingual support system, the per-language cost asymmetry compounds quickly. Anthropic, OpenAI, and Google publish per-model token costs; the actual content cost depends on what language and format you’re paying for.
Tokenizer differences across models
Each model family has its own tokenizer:
- OpenAI cl100k_base (GPT-3.5, GPT-4): ~100,000-token vocabulary. The reference modern English tokenizer.
- OpenAI o200k_base (GPT-4o, o-series): 200,000-token vocabulary. Better at non-English and code. A given document needs ~10-15% fewer tokens than cl100k.
- Anthropic Claude tokenizer: proprietary. Approximately similar density to cl100k for English; differs measurably for code and non-English. Anthropic publishes a token counting endpoint to estimate before submitting.
- Google Gemini: uses SentencePiece. Roughly comparable density to cl100k.
Implication: the same prompt sent to GPT-4o vs Claude vs Gemini doesn’t produce identical token counts. Budgeting across providers requires per-provider token estimation, not a single “1 word ≈ 1.3 tokens” rule.
Where tokenization affects prompt design
- Long-context costs. A 100k-token context window holding all your documentation is great until you realise the per-call cost is $1+ for typical usage. Token counts compound across multi-turn conversations.
- JSON vs natural language.Asking for JSON output costs ~30-50% more tokens than asking for comparable plain prose. JSON’s punctuation gets tokenized aggressively.
- Code tasks. Code is roughly 2× denser in tokens than prose. A 200-line file might be 2,000-3,000 tokens. Tooling that includes your whole repo as context adds up fast.
- Non-English languages. 2-4× more tokens per character. For multilingual products this is a first-order cost.
How to estimate tokens before you pay
- Use a token-count tool. Our AI token counter implements multiple tokenizers and reports the exact count for your input.
- Use the official tokenizer library. OpenAI’s
tiktoken(Python), Anthropic’s tokenizer API, or hosted token counters. These are the ground truth for billing. - Rule of thumb. For English prose: 1 word ≈ 1.3 tokens. For code: 1 line ≈ 8-15 tokens. For Chinese: 1 character ≈ 1.5 tokens.
- Budget the output too. Many providers charge more for output than input (typically 3-5× per token). A 2000-token output is more expensive than a 2000-token input.
The deeper structural reason for BPE
Modern LLMs see tokens, not characters. The model’s embeddings, attention, and output are all defined over a finite token vocabulary. Character-level models exist but are slower (each character is one input position) and harder to train. Word-level models can’t handle unseen words (out-of-vocabulary problem). BPE is the compromise that won.
For deeper background, see our GPT token glossary entry and the how token pricing works guide.
Walkthrough: tokenizing a single sentence
Sentence: “The rambunctious cat’s purr 😺 was unmistakable.” (9 words, 49 characters with the emoji.)
Under cl100k_base (GPT-4):
The→ 1 token (very common word with leading space variant).rambunctious→ 3 tokens (ram+bunct+ious).cat→ 1 token.’s→ 1 token (the apostrophe-s contraction is a single merge).purr→ 1 token.😺→ 3 tokens (the emoji’s UTF-8 bytes split across multiple BPE pieces).was→ 1 token.unmistakable→ 2 tokens (unm+istakable)..→ 1 token.
Total: 14 tokens for 9 words — a 1.56 tokens-per-word ratio driven up by rambunctious(3 tokens) and the emoji (3 tokens). Replacing both with common alternatives drops the cost: “The loud cat’s purr was unmistakable” runs ~9 tokens for the same idea. For high-volume API usage, this kind of vocabulary engineering compounds.
Common mistakes
- Estimating tokens from character count. The “1 token ≈ 4 chars” rule of thumb is wildly off for code, JSON, and non-English. A 1000-char JSON blob can be 400-800 tokens depending on key names and nesting.
- Forgetting system-prompt tokens. A 2000-token system prompt is included in every request and billed every call. Multi-turn agents with growing chat history pay for the entire prior conversation on each round, not just the newest message.
- Caching benefits depend on prefix stability. Prompt caching (when available) only kicks in when the token sequence is byte-identical at the prefix. A dynamically-inserted timestamp at position 50 invalidates cache for every following token. Put dynamic content at the end, not the middle.
- Using the wrong tokenizer for cost estimation.cl100k_base and o200k_base produce ~10-15% different token counts for the same input. If you’re modelling costs for GPT-4o using cl100k, the estimate is high; for o-series models also high. Use the tokenizer matching the target model.
- Stripping whitespace aggressively. Many tokens begin with a leading space. Removing all spaces and concatenating words can produce moretokens, not fewer, because the tokenizer can’t use its common “ word” merges and falls back to byte-level splits.
For deeper background, see our GPT token glossary entry, the how token pricing works guide, and the cron expression tutorial for an unrelated but comparably-dense parsing primitive.
Sources: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016, the foundational BPE paper); OpenAI tiktoken repository (2024); Anthropic developer documentation on tokens and context (2024); Karpathy A, “Let’s build the GPT Tokenizer” lecture (2024).
Frequently asked questions
- What is a token in the context of GPT and LLMs?
- A token is the basic unit of text that a language model processes — neither a character nor a full word. Common English words like 'the' or 'cat' are single tokens; less frequent words are split into 2–5 subword pieces. One token corresponds to approximately 4 characters or 0.75 words on average for English text.
- How does byte-pair encoding (BPE) tokenization work?
- BPE starts with individual bytes as the vocabulary, then iteratively merges the most frequent adjacent pair into a new token. After hundreds of thousands of merges on a training corpus, the resulting vocabulary captures common words and subword fragments efficiently.
- Why does non-English text use more tokens than English?
- GPT tokenizers are trained predominantly on English text, so rare characters in non-English scripts (Chinese, Arabic, Korean) may each map to 1–3 bytes or individual characters rather than full words. A Chinese sentence may use 2–4× as many tokens per word compared to equivalent English.
- How many tokens does a typical page of text contain?
- A 500-word page of plain English text contains approximately 650–700 tokens, since short words and punctuation each consume tokens. Code, JSON, and technical text with unusual symbols can run 20–40% more tokens per word than prose.
- Does an emoji always count as one token?
- No — a single emoji often spans 2–8 tokens because complex emoji (especially skin-tone modifiers and ZWJ sequences) are broken into multiple UTF-8 bytes, each potentially tokenized separately. A family emoji with skin tone can use 6–10 tokens.
- Why does tokenization affect the cost of using LLM APIs?
- LLM APIs like OpenAI and Anthropic charge per token for both input (prompt) and output (completion). A prompt written in inefficient language (many rare words, code, non-English text) can cost 2–3× more than a semantically equivalent prompt in common English phrasing.
Sources & references
Authoritative references cited by this piece. Verified by Buğra Sözeri on the dates shown and re-checked at every deploy.
- OpenAI — tiktoken tokenizer — Official OpenAI BPE tokenizer; canonical reference for the GPT-3.5, GPT-4, GPT-4o token mappings analysed(as of )
- Sennrich R, Haddow B, Birch A — Neural Machine Translation of Rare Words with Subword Units (ACL 2016) — Original BPE paper that established the subword-tokenisation approach every modern LLM uses(as of )
- Kudo T, Richardson J — SentencePiece (EMNLP 2018) — Reference for the SentencePiece tokenizer used by some open-weight model families(as of )
- Anthropic — Token counting for Claude — Reference for the cross-vendor tokenisation comparison made in the article(as of )
- Hugging Face — Tokenizers library documentation — Open-source reference implementation cross-checked for the open-weight model tokenizer behaviours discussed(as of )
- Karpathy A — "Let's build the GPT Tokenizer" (lecture, 2024) — Pedagogical reference for the BPE merge-rule walkthrough underlying the algorithm explanation(as of )
Related
Published May 16, 2026 · Last reviewed May 31, 2026