Glossary
Unicode
The character encoding for every language
Unicode is the universal character encoding standard maintained by the Unicode Consortium. It assigns a unique number (a “code point”) to every character in every writing system that’s been formally proposed and accepted — currently around 150,000 characters across 161 scripts, including modern languages, historical scripts, mathematical symbols, technical notation, and emoji.
Code points are written as U+XXXX (4-6 hex digits). The letter A is U+0041; the Turkish dotless i is U+0131; the rocket emoji is U+1F680.
Unicode itself doesn’t specify how code points are serialised to bytes — that’s the job of encodings. The three main encodings:
- UTF-8 — variable-width 1-4 bytes per code point. ASCII-compatible. The dominant web encoding.
- UTF-16 — 2 or 4 bytes per code point. Used internally by Java, JavaScript strings, and Windows.
- UTF-32 — exactly 4 bytes per code point. Simple but wasteful; rarely used for storage.
Convertitive’s tools handle full Unicode where they should — the Base64 encoder handles UTF-8 correctly, the word counter counts code points and graphemes, the case converter respects Turkish-specific casing (the dotted vs dotless I problem).
Related
Published May 14, 2026