Skip to content

Glossary

Context window

The hard limit on what an LLM can read at once

By Published Updated

The context window of an LLM is the maximum number of tokens it can process in a single inference call. The window covers input and output combined — if you fill the input to the brim there’s no room for the model to respond.

Context windows have grown dramatically:

  • GPT-3 (2020): 2,048 tokens
  • GPT-3.5 (2022): 4,096 → 16,384 tokens
  • GPT-4 (2023): 8,192 → 32,768 → 128,000 tokens
  • Claude 3 (2024): 200,000 tokens (~150,000 words)
  • Gemini 1.5 Pro (2024): 1,000,000 tokens (~750,000 words — a long novel)
  • Frontier models (2026): 1-2 million tokens common

Larger windows enable putting whole books, codebases, or long conversation histories into a single prompt. Practical limits remain: throughput drops at higher context lengths, cost scales linearly with input tokens (cached or not), and model attention degrades at very long contexts in well-documented ways (“needle in a haystack” benchmarks).

Worked example

You want to summarise a 250-page novel (~75,000 words). In OpenAI’s tokenizer (cl100k_base), that text comes to roughly 100,000 tokens. On GPT-3 (2k context), the novel cannot fit at all — you’d have to chunk it into 50 pieces and run a recursive summarisation tree. On GPT-3.5 16k, you’d need ~7 chunks. On GPT-4 128k, the whole novel fits with 28k tokens to spare for instructions and output. On Claude 3 (200k), same deal with even more headroom. On Gemini 1.5 Pro (1M), you could fit the entire novel plus the previous nine books in the series and still have room. The cost picture also shifts: at $3/M input tokens, the 100k-token summarisation costs $0.30 in input alone — cheap per request, but a thousand such requests is $300, which is why batch APIs and prompt caching have become economic necessities.

When and why it matters

The context window matters whenever an LLM workflow involves more input than a typical chat: legal-document review, codebase-wide refactoring, research synthesis across multiple papers, customer-support conversations with long history, agent loops accumulating tool outputs. The mistake to avoid is assuming “bigger window = better answers”: the “Lost in the Middle” effect (Liu 2023) shows that information placed in the middle of a long context is recalled less reliably than information at the start or end. The practical engineering pattern is to (a) put the most critical instructions and constraints at the start, (b) put the immediate user query at the end, and (c) treat the middle as “reference material the model may consult but should not be required to use.” For retrieval-augmented generation, smaller context windows with precise retrieval often outperform larger windows with everything dumped in. Reference: OpenAI Models documentation — context window limits.

The attention-cost problem behind the scenes: the original transformer attention mechanism is O(n²) in the sequence length — doubling the context window quadruples the compute cost of a forward pass. Frontier 1M-token models work because of architectural tricks: FlashAttention (Tri Dao, 2022) and FlashAttention-2 (2023) restructure the operation to be IO-aware and shrink memory bandwidth costs; sparse-attention variants (sliding-window, dilated) drop the global quadratic term; and ring/sequence-parallel attention shards the sequence across GPUs. None of these tricks remove the underlying scaling — they just push the wall further out.

Why “effective context” ≠ advertised context: the “needle in a haystack” benchmark inserts a unique fact at a known position in a long context and asks the model to retrieve it. Frontier models score near 100% on this benchmark up to their advertised window. The harder benchmarks — multi-fact retrieval, multi-hop reasoning across the long context, summarisation that synthesises across the whole input — show meaningfully lower scores past ~50-100K tokens, even on 1M-token models. The practical rule: a 1M-token window is reliable for “look up specific things in this big document” tasks, but reasoning quality typically degrades past the first ~100K. Compare provider claims against your specific workload. Related: GPT token, LLM. Reference: Liu N et al. — Lost in the Middle (2023).

Frequently asked questions

What is a context window?
A context window is the maximum number of tokens an LLM can process in a single inference call — both the input (prompt + conversation history) and the output combined. Models with a 200,000-token context window can process roughly 150,000 words at once.
How does the context window affect LLM usage in practice?
When summarising a 500-page legal document with GPT-4 (128k context), a developer must split the document into chunks because it exceeds the window. Claude 3.5 with a 200k-token window can process the entire document in a single call without chunking.
What is the difference between context window and memory?
The context window holds all tokens currently in the active conversation — it is cleared between sessions. Memory (in multi-session agents) is a separate retrieval system that stores and fetches relevant past interactions. Context is fast and precise; memory is persistent but approximate.
Does a larger context window mean slower responses?
Yes — attention mechanisms in transformers scale as O(n²) with sequence length, so doubling the context roughly quadruples attention computation. Models with very large context windows use optimised attention (e.g., flash attention) to reduce this cost, but longer contexts still increase latency and API cost.

Related

Published May 14, 2026 · Last reviewed May 31, 2026