Glossary
Context window
The hard limit on what an LLM can read at once
The context window of an LLM is the maximum number of tokens it can process in a single inference call. The window covers input and output combined — if you fill the input to the brim there’s no room for the model to respond.
Context windows have grown dramatically:
- GPT-3 (2020): 2,048 tokens
- GPT-3.5 (2022): 4,096 → 16,384 tokens
- GPT-4 (2023): 8,192 → 32,768 → 128,000 tokens
- Claude 3 (2024): 200,000 tokens (~150,000 words)
- Gemini 1.5 Pro (2024): 1,000,000 tokens (~750,000 words — a long novel)
- Frontier models (2026): 1-2 million tokens common
Larger windows enable putting whole books, codebases, or long conversation histories into a single prompt. Practical limits remain: throughput drops at higher context lengths, cost scales linearly with input tokens (cached or not), and model attention degrades at very long contexts in well-documented ways (“needle in a haystack” benchmarks).
Related
Published May 14, 2026