Glossary

Context window

The hard limit on what an LLM can read at once

The context window of an LLM is the maximum number of tokens it can process in a single inference call. The window covers input and output combined — if you fill the input to the brim there’s no room for the model to respond.

Context windows have grown dramatically:

GPT-3 (2020): 2,048 tokens
GPT-3.5 (2022): 4,096 → 16,384 tokens
GPT-4 (2023): 8,192 → 32,768 → 128,000 tokens
Claude 3 (2024): 200,000 tokens (~150,000 words)
Gemini 1.5 Pro (2024): 1,000,000 tokens (~750,000 words — a long novel)
Frontier models (2026): 1-2 million tokens common

Larger windows enable putting whole books, codebases, or long conversation histories into a single prompt. Practical limits remain: throughput drops at higher context lengths, cost scales linearly with input tokens (cached or not), and model attention degrades at very long contexts in well-documented ways (“needle in a haystack” benchmarks).

Published May 14, 2026

Context window

Related