Skip to content

Glossary

UTF-8

The web's character encoding

UTF-8 (Unicode Transformation Format, 8-bit) is a variable-width encoding for Unicode characters. Each code point encodes to 1-4 bytes depending on its value: ASCII characters (U+0000 to U+007F) take one byte; common Latin extensions and Greek/Cyrillic two; CJK ideographs three; emoji and rare scripts four.

Designed by Ken Thompson and Rob Pike in 1992, primarily for Plan 9. Key properties:

  • ASCII-compatible. Pure-ASCII text is valid UTF-8 byte-for-byte.
  • Self-synchronising. If a stream is corrupted mid-character, the decoder can find the next character boundary without rewinding.
  • No byte-order issue. Bytes are processed left-to-right; there’s no big-endian vs little-endian distinction.
  • Compact for Latin scripts, less compact than UTF-16 for CJK-dominant content (3 bytes per character vs 2).

UTF-8 is the dominant encoding on the web — over 98% of pages as of 2024. It’s the default for HTML5, JSON (where the RFC 8259 spec actually mandates UTF-8), and virtually every modern protocol. Old encodings (Windows-1252, ISO-8859-1, Shift-JIS) survive in legacy systems but should be converted on entry to any modern stack.

Related

Published May 14, 2026