Glossary
LLM
Large Language Model
By Buğra SözeriPublished Updated
LLM (Large Language Model) is a neural network trained on vast amounts of text — typically hundreds of billions of words — to predict the next token in a sequence given preceding context. The “large” refers to the parameter count: modern frontier LLMs range from 100 billion to 2+ trillion parameters.
Underlying architecture: transformer (Vaswani et al., 2017), with variations on the original encoder-decoder split. GPT family is decoder-only; original BERT was encoder-only; T5 retains both. Frontier models since 2020 are overwhelmingly decoder-only.
Training pipeline: pre-training on a broad text corpus to learn language statistics, followed by instruction tuning and reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF) to make the model follow instructions usefully.
Major LLM families as of 2026: OpenAI’s GPT (3.5, 4, 4o, 5), Anthropic’s Claude (3.5 Sonnet, 4, 4.6, 4.7), Google’s Gemini (1.5, 2, 2.5), Meta’s Llama (2, 3, 4), and several open-weight alternatives (Mistral, Qwen, DeepSeek). Compare API pricing in our token counter.
What LLMs are and are not, mechanically: at inference time, an LLM is a function from a token sequence to a probability distribution over the next token. Generation samples from that distribution (with temperature, top-p, and top-k controls), appends the chosen token, and repeats. There is no “reasoning module” in the classical sense — every output, whether a maths proof or a poem, comes from the same next-token loop. Chain-of-thought prompting works because writing the reasoning into the context lets the model condition later tokens on its own intermediate steps, not because it triggers a different inference mode. The illusion of reasoning is a side-effect of training on an enormous distribution of human text that already contains reasoning.
Why context window and tokenisation matter for cost: every API charge is per token in and per token out, and a model with a 200 K-token context window charges for whatever fraction of it you actually fill. A 50-page PDF dumped into the prompt may cost a few cents to read and a few cents to generate a one-paragraph summary — the bulk of the bill is the input. Tokenisation is provider-specific: GPT’s BPE, Claude’s SentencePiece, and Gemini’s tokeniser produce different token counts for the same text, so the cheapest model on a $-per-token basis is not necessarily the cheapest in practice. Use our token counter to compare actual token counts across providers before committing. Related: GPT token, context window.
Worked example
You want to summarise a 40-page legal contract (~25,000 words ≈ 33,000 tokens) using a frontier model priced at $3 per million input tokens and $15 per million output tokens, asking for a 500-token summary. Input cost: 33,000 / 1,000,000 × $3 = $0.099. Output cost: 500 / 1,000,000 × $15 = $0.0075. Total: ~$0.107 per summary. Now imagine doing this for 10,000 contracts: $1,070 — and that’s before any retries, batching savings, or prompt-caching discounts. If you instead use a cheaper model at $0.25/$1.25 per million, the per-document cost drops to roughly $0.0095, total ~$95 for the same job. The arithmetic explains why production LLM systems route easy tasks to small models and reserve the frontier model for the hardest 5%.
When and why it matters
Knowing how LLMs work prevents the most common production failures. They have no memory between API calls — every request must carry the relevant history in the context window or use a separate retrieval system. They confabulate plausibly-formatted but false facts, particularly for recent events, named-entity attributes, and citations; the standard mitigations are retrieval-augmented generation (RAG), tool use, and per-claim grounding checks. They are sensitive to prompt phrasing in non-obvious ways — “think step by step” meaningfully changes accuracy on arithmetic and logic tasks, and few-shot examples can swing answers more than model choice. Reference: Vaswani et al. — Attention Is All You Need (the transformer paper).
Frequently asked questions
- What is a large language model (LLM)?
- An LLM is a neural network trained on large quantities of text to predict and generate language. Models like GPT-4, Claude, and Gemini have billions of parameters and can answer questions, write code, summarise documents, and perform many language tasks.
- How does an LLM generate text?
- An LLM produces text one token at a time by sampling from a probability distribution over its vocabulary, conditioned on all previous tokens in the conversation. This autoregressive process continues until an end-of-sequence token is produced or a length limit is reached.
- What is the difference between an LLM and a chatbot?
- An LLM is the underlying model; a chatbot is a product built on top of one. The same LLM can power multiple interfaces — chat, API, IDE plugin — each with different system prompts, safety layers, and UX, while sharing the same base model weights.
- What limits how much context an LLM can process?
- The context window — measured in tokens — defines the maximum combined length of input and output the model can handle in one inference call. Longer contexts increase memory and compute cost quadratically for attention-based models, which is why context window size is a key specification.
Related
Published May 14, 2026 · Last reviewed May 31, 2026