Skip to content

Guide

How GPT tokenization actually works (and why your bill depends on it)

Common words = 1 token. Rare words = 2-5 tokens. Emoji, code symbols, and non-English text spend tokens fast.

By Published Updated

Every API call to a large language model is metered intokens — neither characters nor words. Tokens are the unit the model actually consumes after text is encoded by a byte-pair encoding(BPE) tokenizer. Understanding how tokenization works matters because (a) it determines your bill, and (b) it’s why “a 1000-word document” doesn’t map cleanly to “a 1000-token cost estimate.”

How BPE tokenizers work

The algorithm in a few lines:

  1. Start with a vocabulary of single bytes (256 entries).
  2. Find the most-frequent adjacent pair of vocabulary entries in a large training corpus.
  3. Add that pair as a new vocabulary entry.
  4. Repeat until vocabulary reaches the target size (50k-100k entries for modern models).
  5. To tokenize new text, greedily replace pairs from longest to shortest match.

Result: common English words like “the”, “and”, “understanding” become a single token each. Rare words like “rambunctious” get split into 2-4 tokens (e.g., “ram”+“bunct”+“ious”). Emoji and non-English characters often take 2-6 tokens each.

Token counts for common content

Approximate token counts for GPT-4 (cl100k_base tokenizer):

ContentTokensTokens / word
English prose (this page)~1.3 / word1.3
News articles~1.3 / word1.3
Technical / scientific writing~1.5 / word1.5
Programming code (Python)~2 / word2.0
JSON / XML (lots of punctuation)~2.5 / word2.5
Spanish / French / German~1.6 / word1.6
Russian / Greek (Cyrillic / Greek script)~3-4 / word3-4
Chinese (simplified)~1.5 / character1.5/char
Japanese / Korean~1-2 / character1-2/char
Emoji ✨~2-3 each

The reference figure for English is ~750 words per 1,000 tokens. Non-Latin scripts cost significantly more tokens per character because they weren’t represented as densely in the training corpus.

Why the cost gap matters

Per-token pricing means non-English content costs 2-4× more for the same idea. A 1,000-word document costs:

  • English: ~1,300 tokens → $0.013 at GPT-4o input price (~$10/M tokens).
  • Russian: ~3,500 tokens → $0.035 (2.7× more for the same content).
  • Chinese: ~1,500 tokens (per character, dense scripts compensate slightly) → $0.015.

For a translation business or a multilingual support system, the per-language cost asymmetry compounds quickly. Anthropic, OpenAI, and Google publish per-model token costs; the actual content cost depends on what language and format you’re paying for.

Tokenizer differences across models

Each model family has its own tokenizer:

  • OpenAI cl100k_base (GPT-3.5, GPT-4): ~100,000-token vocabulary. The reference modern English tokenizer.
  • OpenAI o200k_base (GPT-4o, o-series): 200,000-token vocabulary. Better at non-English and code. A given document needs ~10-15% fewer tokens than cl100k.
  • Anthropic Claude tokenizer: proprietary. Approximately similar density to cl100k for English; differs measurably for code and non-English. Anthropic publishes a token counting endpoint to estimate before submitting.
  • Google Gemini: uses SentencePiece. Roughly comparable density to cl100k.

Implication: the same prompt sent to GPT-4o vs Claude vs Gemini doesn’t produce identical token counts. Budgeting across providers requires per-provider token estimation, not a single “1 word ≈ 1.3 tokens” rule.

Where tokenization affects prompt design

  1. Long-context costs. A 100k-token context window holding all your documentation is great until you realise the per-call cost is $1+ for typical usage. Token counts compound across multi-turn conversations.
  2. JSON vs natural language.Asking for JSON output costs ~30-50% more tokens than asking for comparable plain prose. JSON’s punctuation gets tokenized aggressively.
  3. Code tasks. Code is roughly 2× denser in tokens than prose. A 200-line file might be 2,000-3,000 tokens. Tooling that includes your whole repo as context adds up fast.
  4. Non-English languages. 2-4× more tokens per character. For multilingual products this is a first-order cost.

How to estimate tokens before you pay

  1. Use a token-count tool. Our AI token counter implements multiple tokenizers and reports the exact count for your input.
  2. Use the official tokenizer library. OpenAI’s tiktoken(Python), Anthropic’s tokenizer API, or hosted token counters. These are the ground truth for billing.
  3. Rule of thumb. For English prose: 1 word ≈ 1.3 tokens. For code: 1 line ≈ 8-15 tokens. For Chinese: 1 character ≈ 1.5 tokens.
  4. Budget the output too. Many providers charge more for output than input (typically 3-5× per token). A 2000-token output is more expensive than a 2000-token input.

The deeper structural reason for BPE

Modern LLMs see tokens, not characters. The model’s embeddings, attention, and output are all defined over a finite token vocabulary. Character-level models exist but are slower (each character is one input position) and harder to train. Word-level models can’t handle unseen words (out-of-vocabulary problem). BPE is the compromise that won.

For deeper background, see our GPT token glossary entry and the how token pricing works guide.

Walkthrough: tokenizing a single sentence

Sentence: “The rambunctious cat’s purr 😺 was unmistakable.” (9 words, 49 characters with the emoji.)

Under cl100k_base (GPT-4):

  • The → 1 token (very common word with leading space variant).
  • rambunctious → 3 tokens ( ram + bunct + ious).
  • cat → 1 token.
  • ’s → 1 token (the apostrophe-s contraction is a single merge).
  • purr → 1 token.
  • 😺 → 3 tokens (the emoji’s UTF-8 bytes split across multiple BPE pieces).
  • was → 1 token.
  • unmistakable → 2 tokens ( unm + istakable).
  • . → 1 token.

Total: 14 tokens for 9 words — a 1.56 tokens-per-word ratio driven up by rambunctious(3 tokens) and the emoji (3 tokens). Replacing both with common alternatives drops the cost: “The loud cat’s purr was unmistakable” runs ~9 tokens for the same idea. For high-volume API usage, this kind of vocabulary engineering compounds.

Common mistakes

  • Estimating tokens from character count. The “1 token ≈ 4 chars” rule of thumb is wildly off for code, JSON, and non-English. A 1000-char JSON blob can be 400-800 tokens depending on key names and nesting.
  • Forgetting system-prompt tokens. A 2000-token system prompt is included in every request and billed every call. Multi-turn agents with growing chat history pay for the entire prior conversation on each round, not just the newest message.
  • Caching benefits depend on prefix stability. Prompt caching (when available) only kicks in when the token sequence is byte-identical at the prefix. A dynamically-inserted timestamp at position 50 invalidates cache for every following token. Put dynamic content at the end, not the middle.
  • Using the wrong tokenizer for cost estimation.cl100k_base and o200k_base produce ~10-15% different token counts for the same input. If you’re modelling costs for GPT-4o using cl100k, the estimate is high; for o-series models also high. Use the tokenizer matching the target model.
  • Stripping whitespace aggressively. Many tokens begin with a leading space. Removing all spaces and concatenating words can produce moretokens, not fewer, because the tokenizer can’t use its common “ word” merges and falls back to byte-level splits.

For deeper background, see our GPT token glossary entry, the how token pricing works guide, and the cron expression tutorial for an unrelated but comparably-dense parsing primitive.

Sources: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016, the foundational BPE paper); OpenAI tiktoken repository (2024); Anthropic developer documentation on tokens and context (2024); Karpathy A, “Let’s build the GPT Tokenizer” lecture (2024).

Frequently asked questions

What is a token in the context of GPT and LLMs?
A token is the basic unit of text that a language model processes — neither a character nor a full word. Common English words like 'the' or 'cat' are single tokens; less frequent words are split into 2–5 subword pieces. One token corresponds to approximately 4 characters or 0.75 words on average for English text.
How does byte-pair encoding (BPE) tokenization work?
BPE starts with individual bytes as the vocabulary, then iteratively merges the most frequent adjacent pair into a new token. After hundreds of thousands of merges on a training corpus, the resulting vocabulary captures common words and subword fragments efficiently.
Why does non-English text use more tokens than English?
GPT tokenizers are trained predominantly on English text, so rare characters in non-English scripts (Chinese, Arabic, Korean) may each map to 1–3 bytes or individual characters rather than full words. A Chinese sentence may use 2–4× as many tokens per word compared to equivalent English.
How many tokens does a typical page of text contain?
A 500-word page of plain English text contains approximately 650–700 tokens, since short words and punctuation each consume tokens. Code, JSON, and technical text with unusual symbols can run 20–40% more tokens per word than prose.
Does an emoji always count as one token?
No — a single emoji often spans 2–8 tokens because complex emoji (especially skin-tone modifiers and ZWJ sequences) are broken into multiple UTF-8 bytes, each potentially tokenized separately. A family emoji with skin tone can use 6–10 tokens.
Why does tokenization affect the cost of using LLM APIs?
LLM APIs like OpenAI and Anthropic charge per token for both input (prompt) and output (completion). A prompt written in inefficient language (many rare words, code, non-English text) can cost 2–3× more than a semantically equivalent prompt in common English phrasing.

Sources & references

Authoritative references cited by this piece. Verified by Buğra Sözeri on the dates shown and re-checked at every deploy.

Related

Published May 16, 2026 · Last reviewed May 31, 2026