Skip to content

Glossary

GPT token

The atomic unit of LLM input and output

A GPT token (more generally, a token) is the unit a large language model processes. Models don’t see characters or words directly — text is first tokenised into a sequence of integer IDs from a fixed vocabulary, typically 50,000-200,000 tokens.

OpenAI’s GPT-3, GPT-4, and GPT-5 use BPE (Byte Pair Encoding) tokenisers. Common English words are usually one token (“the” → 1, “and” → 1); longer or rarer words split into multiple tokens (“tokenization” → maybe 3); code splits much more heavily (identifiers, brackets, indentation each become their own tokens).

Practical ratios:

  • English prose: ~4 characters per token, ~0.75 words per token
  • Code: ~2-3 characters per token (heavier splitting)
  • Non-Latin scripts (Chinese, Japanese, Arabic): can be 1 character per token or worse

Both input and output tokens are billed. Output tokens typically cost 3-5× input. Use our token counter for live estimation across GPT, Claude, Gemini, and Llama models.

Related

Published May 14, 2026