Skip to content

Glossary

Regression

Fitting predictors to outcomes

By Published Updated

Regression is a statistical method for modelling the relationship between a dependent variable (the outcome you want to predict) and one or more independent variables (the predictors). The output is a function — typically with parameters fit to historical data — that lets you estimate the outcome from new predictor values.

The simplest form is linear regression: y = β₀ + β₁x + ε. The algorithm finds the β coefficients that minimise the sum of squared residuals (the “errors”). For a dataset of (height, weight) pairs, linear regression produces the best-fit line through the points, which lets you estimate weight from any new height.

Standard varieties:

  • Multiple linear regression: several predictors. y = β₀ + β₁x₁ + β₂x₂ + ... + ε.
  • Polynomial regression: the predictors include powers of x. y = β₀ + β₁x + β₂x² + .... Fits curved relationships.
  • Logistic regression: the outcome is binary (0/1). The model outputs a probability via the logistic function.
  • Ridge / lasso / elastic-net: linear regression with a penalty for large coefficients. Used when there are many predictors and you want to avoid overfitting.

The key sanity checks for any regression: how well does it fit the training data (R², residual plots), how well does it generalise to new data (cross-validation, holdout test set), do the residuals look random (or do they show patterns the model missed)?

Regression is the workhorse of empirical science. Correlation tells you how strongly two variables move together; regression gives you the equation that converts one to the other.

The classical assumptions and where they break: linear regression’s standard inferential machinery (p-values on coefficients, confidence intervals, F-tests) depends on four assumptions — linearity, independent residuals, equal-variance residuals (homoscedasticity), and normal residuals. Real-world data violates one or more of these regularly: time-series data violates independence; financial returns violate homoscedasticity; small samples violate normality. Modern statisticians either correct the standard errors (heteroscedasticity-robust “sandwich” estimators, clustered SEs) or skip the inferential apparatus entirely and use bootstrap resampling to estimate confidence intervals empirically. The coefficient point estimates themselves are unbiased under much weaker conditions — only the uncertainty estimates need rescuing.

Where regression silently fails — the “regression to the mean” trap: the technique gets its name from Francis Galton’s 1886 observation that tall parents tend to have somewhat shorter children, and short parents somewhat taller — both move toward the population mean. Naively extrapolating “the trend” from a regression of children on parents would suggest the population would converge to identical heights over generations, which doesn’t happen. The phenomenon is purely statistical (selection on extreme values + noisy measurement = predicted values closer to the mean) and produces the textbook trap of confusing “regression to the mean” with a real causal effect. Sports performance, customer satisfaction, and medical outcomes all show this; any “intervention that helped people at the extreme” needs a control group to distinguish real effect from mean reversion. Reference: NIST/SEMATECH e-Handbook — Linear Regression.

Frequently asked questions

What is regression in statistics?
Regression is a statistical method for modelling the relationship between one or more predictor variables and a continuous outcome variable. Linear regression fits a straight line that minimises the sum of squared residuals between predicted and observed values.
How is regression used in practice?
A retailer uses linear regression to predict sales from advertising spend and seasonality. A doctor uses logistic regression to estimate the probability of a patient developing diabetes from clinical markers. Both use the model to make quantitative predictions from new inputs.
What is the difference between linear regression and logistic regression?
Linear regression predicts a continuous numeric outcome (e.g. house price). Logistic regression predicts the probability of a binary outcome (e.g. loan default yes/no) using a sigmoid function to constrain the output to 0 to 1. The fitting method and interpretation differ substantially.

Related

Published May 16, 2026 · Last reviewed May 31, 2026