Glossary
Unicode
The character encoding for every language
By Buğra SözeriPublished Updated
Unicode is the universal character encoding standard maintained by the Unicode Consortium. It assigns a unique number (a “code point”) to every character in every writing system that’s been formally proposed and accepted — currently around 150,000 characters across 161 scripts, including modern languages, historical scripts, mathematical symbols, technical notation, and emoji.
Code points are written as U+XXXX (4-6 hex digits). The letter A is U+0041; the Turkish dotless i is U+0131; the rocket emoji is U+1F680.
Unicode itself doesn’t specify how code points are serialised to bytes — that’s the job of encodings. The three main encodings:
- UTF-8 — variable-width 1-4 bytes per code point. ASCII-compatible. The dominant web encoding.
- UTF-16 — 2 or 4 bytes per code point. Used internally by Java, JavaScript strings, and Windows.
- UTF-32 — exactly 4 bytes per code point. Simple but wasteful; rarely used for storage.
Convertitive’s tools handle full Unicode where they should — the Base64 encoder handles UTF-8 correctly, the word counter counts code points and graphemes, the case converter respects Turkish-specific casing (the dotted vs dotless I problem).
Code point ≠ character ≠ grapheme — the distinction that breaks every string library: the letter “é” can be represented two ways in Unicode — as the single code point U+00E9 (precomposed) or as U+0065 (e) followed by U+0301 (combining acute accent). Both render identically; both count as “one character” to a human; one is one code point, the other is two. Emoji are worse: 👨👩👧 (family) is a ZWJ sequence of seven code points but one grapheme. Length counts in JavaScript (str.length) return UTF-16 code units, not graphemes — the family emoji has a string length of 11. Use Intl.Segmenter (modern browsers, Node 16+) for accurate grapheme counts.
Unicode normalisation forms — when text doesn’t match itself: two visually-identical strings can have different byte representations because of the precomposed-vs-decomposed split above. The Unicode Standard defines four normalisation forms (NFC, NFD, NFKC, NFKD) that canonicalise this. NFC (composed) is the default for HTML and most file systems; NFD (decomposed) is the macOS HFS+ default and the reason file paths sometimes look identical but compare unequal across systems. Database string-equality and search-index lookups should normalise to NFC on write to avoid the “identical query doesn’t match” bug. Reference: The Unicode Standard, Version 15.
Worked example
Take the string “café”. Two equally-valid Unicode representations: NFC = U+0063 U+0061 U+0066 U+00E9 (4 code points), NFD = U+0063 U+0061 U+0066 U+0065 U+0301 (5 code points, “e” + combining acute). Both render identically. Now consider what each looks like in different counts: in UTF-8 NFC is 5 bytes (63 61 66 C3 A9); NFD is 6 bytes (63 61 66 65 CC 81). In JavaScript: "café".length = 4 for NFC, 5 for NFD. Compare equality without normalising: nfc === nfd is false. Run both through String.prototype.normalize("NFC") first: they become identical. Bug fixed. This is the exact root cause of “I searched for José and the record exists but doesn’t come back” in databases storing user-submitted names from mixed Mac and Windows clients.
When and why it matters
Every internationalised application hits Unicode edge cases at some point: usernames with combining marks, file paths copied between macOS (NFD) and Linux (NFC), Vietnamese with stacked diacritics, Turkish casing rules (lowercase of “I” is “ı” not “i” — the “Turkish-I problem” caused real iOS crashes in 2018), Arabic and Hebrew right-to-left rendering, and ZWJ-based emoji whose component code points evolve faster than your phone’s font updates. The defensive rules: normalise to NFC on input, use Intl.Collator for locale-aware sorting (not Array.sort’s code-point sort), use Intl.Segmenter for grapheme-aware truncation, and let CLDR (Unicode’s locale database) drive locale-specific behaviour rather than hard-coding rules. Reference: Unicode Technical Report #15 — Normalization Forms.
Frequently asked questions
- What is Unicode?
- Unicode is the universal character encoding standard that assigns a unique code point to every character in every writing system in active use -- currently about 150,000 characters across 161 scripts. It is the foundation of all text processing in modern software.
- How does Unicode work in practice?
- Each character has a code point (e.g. U+0041 for A, U+1F600 for the grinning face emoji). These are stored in memory using an encoding such as UTF-8, UTF-16, or UTF-32. UTF-8 is the dominant web encoding, representing ASCII characters in 1 byte and others in 2 to 4 bytes.
- What is the difference between Unicode and UTF-8?
- Unicode is the abstract standard mapping characters to code points. UTF-8 is one of several encodings that serialise those code points into bytes for storage and transmission. UTF-8 is the most popular encoding because it is ASCII-compatible and space-efficient for Latin-script text.
Related
Published May 14, 2026 · Last reviewed May 31, 2026