Glossary
UTF-8
The web's character encoding
By Buğra SözeriPublished Updated
UTF-8 (Unicode Transformation Format, 8-bit) is a variable-width encoding for Unicode characters. Each code point encodes to 1-4 bytes depending on its value: ASCII characters (U+0000 to U+007F) take one byte; common Latin extensions and Greek/Cyrillic two; CJK ideographs three; emoji and rare scripts four.
Designed by Ken Thompson and Rob Pike in 1992, primarily for Plan 9. Key properties:
- ASCII-compatible. Pure-ASCII text is valid UTF-8 byte-for-byte.
- Self-synchronising. If a stream is corrupted mid-character, the decoder can find the next character boundary without rewinding.
- No byte-order issue. Bytes are processed left-to-right; there’s no big-endian vs little-endian distinction.
- Compact for Latin scripts, less compact than UTF-16 for CJK-dominant content (3 bytes per character vs 2).
UTF-8 is the dominant encoding on the web — over 98% of pages as of 2024. It’s the default for HTML5, JSON (where the RFC 8259 spec actually mandates UTF-8), and virtually every modern protocol. Old encodings (Windows-1252, ISO-8859-1, Shift-JIS) survive in legacy systems but should be converted on entry to any modern stack.
The BOM (Byte Order Mark) controversy: UTF-16 and UTF-32 use a 2- or 4-byte BOM at the start of a file to declare byte order, but UTF-8 has no byte order to mark. Microsoft tooling habitually prepends a 3-byte UTF-8 BOM (EF BB BF) anyway, while Unix tooling typically doesn’t. The result: shell scripts saved from Notepad break, CSV imports show invisible junk in the first cell, and YAML/JSON parsers reject the file. The Unicode standard tolerates but does not require a UTF-8 BOM; the modern recommendation is “don’t add one.” If you receive a file with one, strip it on read.
String length vs byte length — the always-relevant gotcha: in most languages, "hello".length returns 5 (characters) but Buffer.byteLength("hello", "utf8") returns 5 (bytes — equal for ASCII). For "café", the character length is 4 but the byte length is 5 (the é is 2 bytes). For "🎉", the character length in JavaScript is 2 (surrogate pair) but the byte length is 4 and the grapheme length is 1. Truncating UTF-8 strings by byte count without grapheme awareness regularly produces mojibake — the standard fix is to use the Intl.Segmenter API (modern browsers) or the graphemer npm package. Reference: RFC 3629 — UTF-8, a transformation format of ISO 10646.
Worked example
Encode the string “A€🎉”. Each character’s code point and UTF-8 byte sequence: A (U+0041) → 1 byte 0x41. € (U+20AC) → 3 bytes 0xE2 0x82 0xAC (the high bits encode the length: 1110xxxx 10xxxxxx 10xxxxxx). 🎉 (U+1F389) → 4 bytes 0xF0 0x9F 0x8E 0x89. Total: 8 bytes for 3 characters. In a database column declared VARCHAR(10) with UTF-8 character semantics (modern PostgreSQL, MySQL with utf8mb4), this fits comfortably. In a fixed-byte column declared VARCHAR(10) BYTES (older MySQL utf8 which was actually 3-byte limited), the emoji’s 4-byte encoding doesn’t fit at all — the canonical “cannot store emoji in MySQL” failure mode. The fix on MySQL since 5.5.3 is to use utf8mb4 as the column and connection charset.
When and why it matters
Every byte you process on the modern web is likely UTF-8, but several layers still get it wrong. HTTP requests without an explicit Content-Type: text/...; charset=utf-8 may default to ISO-8859-1 in older proxies; legacy mainframe ETL jobs deliver EBCDIC files that must be transcoded on ingest; Windows console output defaults to the system code page (CP-1252 on most US-English installs) and corrupts piped text without a chcp 65001 first. The defensive practice: declare UTF-8 explicitly everywhere (HTML <meta charset="utf-8">, HTTP headers, file encoding pragmas, database connection charsets), and validate any inbound bytes against the UTF-8 grammar — invalid byte sequences are a reliable signal of a charset mismatch upstream. Reference: WHATWG Encoding Standard.
Frequently asked questions
- What is UTF-8?
- UTF-8 is a variable-width Unicode encoding that represents each code point using 1 to 4 bytes. ASCII characters (U+0000 to U+007F) use exactly 1 byte, making UTF-8 fully backward-compatible with ASCII. It is the dominant encoding on the web, used by over 98% of pages.
- How does UTF-8 encoding work in practice?
- A Latin letter A (U+0041) is stored as a single byte 0x41. The grinning face emoji (U+1F600) requires 4 bytes: 0xF0 0x9F 0x98 0x80. This design means English text stored as ASCII is also valid UTF-8, and multilingual text is handled by adding more bytes only when needed.
- What is the difference between UTF-8 and UTF-16?
- UTF-8 uses 1 to 4 bytes and is ASCII-compatible, making it ideal for web and file storage. UTF-16 uses 2 or 4 bytes and is common in Windows and Java internals. UTF-8 is more space-efficient for ASCII-heavy text; UTF-16 uses the same 2 bytes for most common CJK characters whereas UTF-8 uses 3.
Related
Published May 14, 2026 · Last reviewed May 31, 2026