Skip to content

Guide

How GPT tokenization actually works (and why your bill depends on it)

Common words = 1 token. Rare words = 2-5 tokens. Emoji, code symbols, and non-English text spend tokens fast.

Every API call to a large language model is metered intokens — neither characters nor words. Tokens are the unit the model actually consumes after text is encoded by a byte-pair encoding(BPE) tokenizer. Understanding how tokenization works matters because (a) it determines your bill, and (b) it’s why “a 1000-word document” doesn’t map cleanly to “a 1000-token cost estimate.”

How BPE tokenizers work

The algorithm in a few lines:

  1. Start with a vocabulary of single bytes (256 entries).
  2. Find the most-frequent adjacent pair of vocabulary entries in a large training corpus.
  3. Add that pair as a new vocabulary entry.
  4. Repeat until vocabulary reaches the target size (50k-100k entries for modern models).
  5. To tokenize new text, greedily replace pairs from longest to shortest match.

Result: common English words like “the”, “and”, “understanding” become a single token each. Rare words like “rambunctious” get split into 2-4 tokens (e.g., “ram”+“bunct”+“ious”). Emoji and non-English characters often take 2-6 tokens each.

Token counts for common content

Approximate token counts for GPT-4 (cl100k_base tokenizer):

ContentTokensTokens / word
English prose (this page)~1.3 / word1.3
News articles~1.3 / word1.3
Technical / scientific writing~1.5 / word1.5
Programming code (Python)~2 / word2.0
JSON / XML (lots of punctuation)~2.5 / word2.5
Spanish / French / German~1.6 / word1.6
Russian / Greek (Cyrillic / Greek script)~3-4 / word3-4
Chinese (simplified)~1.5 / character1.5/char
Japanese / Korean~1-2 / character1-2/char
Emoji ✨~2-3 each

The reference figure for English is ~750 words per 1,000 tokens. Non-Latin scripts cost significantly more tokens per character because they weren’t represented as densely in the training corpus.

Why the cost gap matters

Per-token pricing means non-English content costs 2-4× more for the same idea. A 1,000-word document costs:

  • English: ~1,300 tokens → $0.013 at GPT-4o input price (~$10/M tokens).
  • Russian: ~3,500 tokens → $0.035 (2.7× more for the same content).
  • Chinese: ~1,500 tokens (per character, dense scripts compensate slightly) → $0.015.

For a translation business or a multilingual support system, the per-language cost asymmetry compounds quickly. Anthropic, OpenAI, and Google publish per-model token costs; the actual content cost depends on what language and format you’re paying for.

Tokenizer differences across models

Each model family has its own tokenizer:

  • OpenAI cl100k_base (GPT-3.5, GPT-4): ~100,000-token vocabulary. The reference modern English tokenizer.
  • OpenAI o200k_base (GPT-4o, o-series): 200,000-token vocabulary. Better at non-English and code. A given document needs ~10-15% fewer tokens than cl100k.
  • Anthropic Claude tokenizer: proprietary. Approximately similar density to cl100k for English; differs measurably for code and non-English. Anthropic publishes a token counting endpoint to estimate before submitting.
  • Google Gemini: uses SentencePiece. Roughly comparable density to cl100k.

Implication: the same prompt sent to GPT-4o vs Claude vs Gemini doesn’t produce identical token counts. Budgeting across providers requires per-provider token estimation, not a single “1 word ≈ 1.3 tokens” rule.

Where tokenization affects prompt design

  1. Long-context costs. A 100k-token context window holding all your documentation is great until you realise the per-call cost is $1+ for typical usage. Token counts compound across multi-turn conversations.
  2. JSON vs natural language.Asking for JSON output costs ~30-50% more tokens than asking for comparable plain prose. JSON’s punctuation gets tokenized aggressively.
  3. Code tasks. Code is roughly 2× denser in tokens than prose. A 200-line file might be 2,000-3,000 tokens. Tooling that includes your whole repo as context adds up fast.
  4. Non-English languages. 2-4× more tokens per character. For multilingual products this is a first-order cost.

How to estimate tokens before you pay

  1. Use a token-count tool. Our AI token counter implements multiple tokenizers and reports the exact count for your input.
  2. Use the official tokenizer library. OpenAI’s tiktoken(Python), Anthropic’s tokenizer API, or hosted token counters. These are the ground truth for billing.
  3. Rule of thumb. For English prose: 1 word ≈ 1.3 tokens. For code: 1 line ≈ 8-15 tokens. For Chinese: 1 character ≈ 1.5 tokens.
  4. Budget the output too. Many providers charge more for output than input (typically 3-5× per token). A 2000-token output is more expensive than a 2000-token input.

The deeper structural reason for BPE

Modern LLMs see tokens, not characters. The model’s embeddings, attention, and output are all defined over a finite token vocabulary. Character-level models exist but are slower (each character is one input position) and harder to train. Word-level models can’t handle unseen words (out-of-vocabulary problem). BPE is the compromise that won.

For deeper background, see our GPT token glossary entry and the how token pricing works guide.

Sources: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016, the foundational BPE paper); OpenAI tiktoken repository (2024); Anthropic developer documentation on tokens and context (2024).

Related

Published May 16, 2026