Skip to content

Glossary

Context window

The hard limit on what an LLM can read at once

The context window of an LLM is the maximum number of tokens it can process in a single inference call. The window covers input and output combined — if you fill the input to the brim there’s no room for the model to respond.

Context windows have grown dramatically:

  • GPT-3 (2020): 2,048 tokens
  • GPT-3.5 (2022): 4,096 → 16,384 tokens
  • GPT-4 (2023): 8,192 → 32,768 → 128,000 tokens
  • Claude 3 (2024): 200,000 tokens (~150,000 words)
  • Gemini 1.5 Pro (2024): 1,000,000 tokens (~750,000 words — a long novel)
  • Frontier models (2026): 1-2 million tokens common

Larger windows enable putting whole books, codebases, or long conversation histories into a single prompt. Practical limits remain: throughput drops at higher context lengths, cost scales linearly with input tokens (cached or not), and model attention degrades at very long contexts in well-documented ways (“needle in a haystack” benchmarks).

Related

Published May 14, 2026