Skip to content

Glossary

Unicode

The character encoding for every language

Unicode is the universal character encoding standard maintained by the Unicode Consortium. It assigns a unique number (a “code point”) to every character in every writing system that’s been formally proposed and accepted — currently around 150,000 characters across 161 scripts, including modern languages, historical scripts, mathematical symbols, technical notation, and emoji.

Code points are written as U+XXXX (4-6 hex digits). The letter A is U+0041; the Turkish dotless i is U+0131; the rocket emoji is U+1F680.

Unicode itself doesn’t specify how code points are serialised to bytes — that’s the job of encodings. The three main encodings:

  • UTF-8 — variable-width 1-4 bytes per code point. ASCII-compatible. The dominant web encoding.
  • UTF-16 — 2 or 4 bytes per code point. Used internally by Java, JavaScript strings, and Windows.
  • UTF-32 — exactly 4 bytes per code point. Simple but wasteful; rarely used for storage.

Convertitive’s tools handle full Unicode where they should — the Base64 encoder handles UTF-8 correctly, the word counter counts code points and graphemes, the case converter respects Turkish-specific casing (the dotted vs dotless I problem).

Related

Published May 14, 2026