Unicode is the universal character encoding standard that assigns a unique code point to every character in every writing system in active use -- currently about 150,000 characters across 161 scripts. It is the foundation of all text processing in modern software.

How does Unicode work in practice?

Each character has a code point (e.g. U+0041 for A, U+1F600 for the grinning face emoji). These are stored in memory using an encoding such as UTF-8, UTF-16, or UTF-32. UTF-8 is the dominant web encoding, representing ASCII characters in 1 byte and others in 2 to 4 bytes.

What is the difference between Unicode and UTF-8?

Unicode is the abstract standard mapping characters to code points. UTF-8 is one of several encodings that serialise those code points into bytes for storage and transmission. UTF-8 is the most popular encoding because it is ASCII-compatible and space-efficient for Latin-script text.

Glossary

Unicode

The character encoding for every language

By Buğra SözeriPublished May 14, 2026Updated May 31, 2026

Unicode is the universal character encoding standard maintained by the Unicode Consortium. It assigns a unique number (a “code point”) to every character in every writing system that’s been formally proposed and accepted — currently around 150,000 characters across 161 scripts, including modern languages, historical scripts, mathematical symbols, technical notation, and emoji.

Code points are written as U+XXXX (4-6 hex digits). The letter A is U+0041; the Turkish dotless i is U+0131; the rocket emoji is U+1F680.

Unicode itself doesn’t specify how code points are serialised to bytes — that’s the job of encodings. The three main encodings:

UTF-8 — variable-width 1-4 bytes per code point. ASCII-compatible. The dominant web encoding.
UTF-16 — 2 or 4 bytes per code point. Used internally by Java, JavaScript strings, and Windows.
UTF-32 — exactly 4 bytes per code point. Simple but wasteful; rarely used for storage.

Convertitive’s tools handle full Unicode where they should — the Base64 encoder handles UTF-8 correctly, the word counter counts code points and graphemes, the case converter respects Turkish-specific casing (the dotted vs dotless I problem).

Code point ≠ character ≠ grapheme — the distinction that breaks every string library: the letter “é” can be represented two ways in Unicode — as the single code point U+00E9 (precomposed) or as U+0065 (e) followed by U+0301 (combining acute accent). Both render identically; both count as “one character” to a human; one is one code point, the other is two. Emoji are worse: 👨‍👩‍👧 (family) is a ZWJ sequence of seven code points but one grapheme. Length counts in JavaScript (str.length) return UTF-16 code units, not graphemes — the family emoji has a string length of 11. Use Intl.Segmenter (modern browsers, Node 16+) for accurate grapheme counts.

Unicode normalisation forms — when text doesn’t match itself: two visually-identical strings can have different byte representations because of the precomposed-vs-decomposed split above. The Unicode Standard defines four normalisation forms (NFC, NFD, NFKC, NFKD) that canonicalise this. NFC (composed) is the default for HTML and most file systems; NFD (decomposed) is the macOS HFS+ default and the reason file paths sometimes look identical but compare unequal across systems. Database string-equality and search-index lookups should normalise to NFC on write to avoid the “identical query doesn’t match” bug. Reference: The Unicode Standard, Version 15.

Worked example

Take the string “café”. Two equally-valid Unicode representations: NFC = U+0063 U+0061 U+0066 U+00E9 (4 code points), NFD = U+0063 U+0061 U+0066 U+0065 U+0301 (5 code points, “e” + combining acute). Both render identically. Now consider what each looks like in different counts: in UTF-8 NFC is 5 bytes (63 61 66 C3 A9); NFD is 6 bytes (63 61 66 65 CC 81). In JavaScript: "café".length = 4 for NFC, 5 for NFD. Compare equality without normalising: nfc === nfd is false. Run both through String.prototype.normalize("NFC") first: they become identical. Bug fixed. This is the exact root cause of “I searched for José and the record exists but doesn’t come back” in databases storing user-submitted names from mixed Mac and Windows clients.

When and why it matters

Every internationalised application hits Unicode edge cases at some point: usernames with combining marks, file paths copied between macOS (NFD) and Linux (NFC), Vietnamese with stacked diacritics, Turkish casing rules (lowercase of “I” is “ı” not “i” — the “Turkish-I problem” caused real iOS crashes in 2018), Arabic and Hebrew right-to-left rendering, and ZWJ-based emoji whose component code points evolve faster than your phone’s font updates. The defensive rules: normalise to NFC on input, use Intl.Collator for locale-aware sorting (not Array.sort’s code-point sort), use Intl.Segmenter for grapheme-aware truncation, and let CLDR (Unicode’s locale database) drive locale-specific behaviour rather than hard-coding rules. Reference: Unicode Technical Report #15 — Normalization Forms.

Frequently asked questions

What is Unicode?: Unicode is the universal character encoding standard that assigns a unique code point to every character in every writing system in active use -- currently about 150,000 characters across 161 scripts. It is the foundation of all text processing in modern software.
How does Unicode work in practice?: Each character has a code point (e.g. U+0041 for A, U+1F600 for the grinning face emoji). These are stored in memory using an encoding such as UTF-8, UTF-16, or UTF-32. UTF-8 is the dominant web encoding, representing ASCII characters in 1 byte and others in 2 to 4 bytes.
What is the difference between Unicode and UTF-8?: Unicode is the abstract standard mapping characters to code points. UTF-8 is one of several encodings that serialise those code points into bytes for storage and transmission. UTF-8 is the most popular encoding because it is ASCII-compatible and space-efficient for Latin-script text.

Published May 14, 2026 · Last reviewed May 31, 2026

Unicode

Worked example

When and why it matters

Frequently asked questions

Related