Skip to content

Glossary

Correlation

How tightly two variables move together

By Published Updated

Correlation measures the degree to which two variables move together. The standard measure is Pearson’s r: a single number from −1 to +1 where +1 means perfect positive linear relationship, 0 means no linear relationship, and −1 means perfect negative linear relationship.

Practical interpretation:

  • |r| < 0.3 — weak
  • 0.3 ≤ |r| < 0.7 — moderate
  • |r| ≥ 0.7 — strong

Three things every reader of correlation numbers should know:

  1. Pearson’s r only captures linear relationships. Two variables related by a perfect quadratic (y = x²) can have r ≈ 0 if x ranges over both positive and negative values. For non-linear relationships, Spearman’s rho is the more robust alternative.
  2. Correlation is not causation. Two variables can correlate strongly because A causes B, B causes A, both are caused by a third variable, or pure coincidence (especially in small samples or comparing many pairs).
  3. Outliers distort r dramatically. A single outlier in a small dataset can flip the sign of the correlation. Always plot the data before trusting the number.

For categorical or rank-ordered data, use Spearman’s rank correlation instead of Pearson. For binary outcomes, look up the phi coefficient. For nominal categorical data with more than two levels, Cramér’s V.

Anscombe’s quartet — the famous illustration: in 1973, statistician Francis Anscombe constructed four small datasets that all share the same mean, variance, correlation coefficient (0.816), and linear-regression line — yet look completely different when plotted. One is a clean linear trend; one is a perfect curve; one is a line with a single outlier; one is a vertical line with one rogue point. The quartet is still cited as the canonical case for “always plot the data first.” The Datasaurus Dozen (Matejka & Fitzmaurice, 2017) extends the same idea to twelve datasets sharing summary statistics — including one shaped like a dinosaur. Both make the same point: a single correlation number is necessary but never sufficient. Reference: NIST/SEMATECH e-Handbook — Linear Correlation.

Worked example

Five data points (1,2), (2,4), (3,5), (4,4), (5,5). Means x̄ = 3, ȳ = 4. Deviations x − x̄: −2, −1, 0, 1, 2. Deviations y − ȳ: −2, 0, 1, 0, 1. Sum of cross-products Σ(xᵢ − x̄)(yᵢ − ȳ) = 4 + 0 + 0 + 0 + 2 = 6. Sum of squared x deviations: 10; of y deviations: 6. Pearson r = 6 / √(10 × 6) = 6 / 7.746 ≈ 0.775 — a strong positive linear relationship. A scatter plot would show that interpretation holds; if the third point were (3, 50) instead of (3, 5), r would still appear well-defined but the linear model would be dominated by a single outlier.

When correlation drives decisions

Portfolio diversification: assets with low pairwise correlation reduce overall variance even when their individual volatilities are high. The 2008 financial crisis showed the catastrophic counterexample — equities, corporate bonds, REITs, and even gold all moved together when liquidity dried up, and correlation matrices estimated from calm markets understated tail risk. In ML feature engineering, two features with r > 0.95 are effectively redundant; dropping one rarely degrades model accuracy and speeds training. For experimentation, treating correlated metrics as independent inflates the false-positive rate — apply Bonferroni or Benjamini-Hochberg corrections. Related: regression, variance. Background: Pearson correlation coefficient (Wikipedia).

Frequently asked questions

What is correlation?
Correlation (Pearson's r) measures the linear relationship between two variables on a scale of −1 to +1. A value of +1 means a perfect positive linear relationship, −1 means a perfect negative linear relationship, and 0 means no linear relationship.
How is correlation used in practice?
A finance analyst finds that two stocks have a correlation of r = 0.85 — they move together strongly. Adding the second stock to a portfolio containing the first provides little diversification benefit; a stock with r = −0.3 would provide much more.
What is the difference between correlation and causation?
Correlation only measures statistical co-movement, not cause and effect. Ice cream sales and drowning rates are strongly correlated because both rise in summer; ice cream does not cause drowning. Establishing causation requires controlled experiments or causal inference methods.
What is the difference between Pearson and Spearman correlation?
Pearson's r measures linear relationships and requires roughly normally distributed continuous data. Spearman's ρ (rho) ranks the data first and measures monotonic relationships, making it robust to outliers and appropriate for ordinal data like survey ratings.

Related

Published May 16, 2026 · Last reviewed May 31, 2026