Glossary
UTF-8
The web's character encoding
UTF-8 (Unicode Transformation Format, 8-bit) is a variable-width encoding for Unicode characters. Each code point encodes to 1-4 bytes depending on its value: ASCII characters (U+0000 to U+007F) take one byte; common Latin extensions and Greek/Cyrillic two; CJK ideographs three; emoji and rare scripts four.
Designed by Ken Thompson and Rob Pike in 1992, primarily for Plan 9. Key properties:
- ASCII-compatible. Pure-ASCII text is valid UTF-8 byte-for-byte.
- Self-synchronising. If a stream is corrupted mid-character, the decoder can find the next character boundary without rewinding.
- No byte-order issue. Bytes are processed left-to-right; there’s no big-endian vs little-endian distinction.
- Compact for Latin scripts, less compact than UTF-16 for CJK-dominant content (3 bytes per character vs 2).
UTF-8 is the dominant encoding on the web — over 98% of pages as of 2024. It’s the default for HTML5, JSON (where the RFC 8259 spec actually mandates UTF-8), and virtually every modern protocol. Old encodings (Windows-1252, ISO-8859-1, Shift-JIS) survive in legacy systems but should be converted on entry to any modern stack.
Related
Published May 14, 2026