Skip to content

Glossary

ASCII

The original 7-bit character encoding

By Published Updated

ASCII (American Standard Code for Information Interchange, pronounced “ass-key”) is a character encoding that maps 128 symbols to the integers 0-127. Defined in 1963 and standardised internationally as ISO/IEC 646 in 1972.

Coverage:

  • 0-31: control characters (NUL, newline, carriage return, tab, escape, bell)
  • 32-47: punctuation and symbols (space, !, ", #, $, %, etc.)
  • 48-57: digits 0-9
  • 58-64: more punctuation (:, ;, <, =, >, ?, @)
  • 65-90: uppercase A-Z
  • 97-122: lowercase a-z
  • 127: DEL

ASCII fits in 7 bits, which is why old serial protocols and telegraph systems used 7-bit transmission. The 8th bit became the parity bit. Modern systems use 8 bits per character; the upper 128 values (128-255) are interpreted differently depending on the encoding chosen (Latin-1, Windows-1252, etc.).

UTF-8 is backward-compatible with ASCII — any ASCII text is a valid UTF-8 file with identical bytes. That compatibility is one of the main reasons UTF-8 won the Unicode encoding wars.

Worked example

The string “Hi!” encoded in ASCII is three bytes: H = 72 (0x48), i = 105 (0x69), ! = 33 (0x21). Written as binary: 01001000 01101001 00100001. Notice the highest bit of every byte is zero — that’s the 7-bit guarantee. Compare with the same string in UTF-16: 0048 0069 0021 — six bytes for the same content, with little-endian byte ordering adding a BOM at the start. Compare again with UTF-8: identical to ASCII because every character is in the 0-127 range. Now try “café”: ASCII can’t encode the é (Unicode U+00E9); the encoder either errors out or substitutes a question mark. UTF-8 encodes é as two bytes 0xC3 0xA9; Latin-1 encodes it as the single byte 0xE9. The same visible character, three different byte sequences depending on encoding choice.

The complement to ASCII for non-English text is Unicode, which assigns each character a code point and uses encodings (UTF-8, UTF-16, UTF-32) to serialise them as bytes. UTF-8 was specifically designed by Ken Thompson and Rob Pike in 1992 to be ASCII-compatible: any file containing only ASCII characters is identical in ASCII and UTF-8, which is why almost every English-language wire protocol and config format was upgradable to Unicode without breaking changes.

When and why it matters

ASCII matters when debugging encoding bugs, designing wire protocols, or working with any system that processes text byte-by-byte. The bug that ships an é as “cé” or “cé” or “c?” is always an ASCII/UTF-8/Latin-1 mismatch — somewhere in the pipeline, bytes were decoded with the wrong assumption. Sorting algorithms that treat strings as bytes will sort uppercase before lowercase because of ASCII’s ordering — “Zebra” comes before “apple” if you sort raw bytes, surprising to users expecting case-insensitive lexicographic order. Identifier rules in programming languages (variable names, JSON keys) are usually restricted to ASCII for portability across systems whose default encoding cannot be assumed. Anyone designing a CSV format, log file format, or text-based wire protocol benefits from knowing what survives the ASCII pipeline unchanged versus what gets mangled. Reference: RFC 20 — ASCII format for Network Interchange.

Why the alphabet ordering matters: ASCII intentionally placed uppercase letters (65-90) before lowercase (97-122) with a fixed offset of 32 between them. Setting or clearing bit 5 of any letter byte (the 0x20 bit) flips its case — uppercase to lowercase and back. This bit trick is the reason every old C string library could do tolower() with a single arithmetic operation, and it’s baked into countless protocols (HTTP header names, DNS labels) that are nominally case-insensitive but rely on this exact bit-flip equivalence. The trick fails the moment Unicode arrives — Turkish dotted/dotless I are the canonical counter-example.

The control-character legacy nobody can throw away: the 0-31 range carries leftovers from teletype hardware that no modern system uses but every parser has to recognise. NUL (0) terminates C strings; LF (10) and CR (13) split lines (Unix uses LF, Windows uses CRLF, old Mac OS used CR alone); HT (9) is the tab; ESC (27) starts ANSI terminal escape sequences. The bell character (7) once rang a physical bell on Teletype Model 33 terminals; today it triggers the system-beep API on most operating systems. None of these are designable today — they survive because half a century of file formats and wire protocols expect them. Reference: ISO/IEC 646 — Information technology — ISO 7-bit coded character set.

Frequently asked questions

What is ASCII?
ASCII (American Standard Code for Information Interchange) is a 7-bit character encoding that maps 128 values (0–127) to letters, digits, punctuation, and control characters. It was standardised in 1963 and remains the foundation of all modern text encodings.
How is ASCII used in practice?
Any English-language text file, URL, or HTTP header that avoids extended characters is valid ASCII. Email protocols like SMTP were originally ASCII-only, which is why non-ASCII email subjects must be encoded with Base64 or quoted-printable.
What is the difference between ASCII and UTF-8?
UTF-8 is a superset of ASCII: the first 128 code points of UTF-8 are identical to ASCII, encoded as single bytes. UTF-8 extends beyond 128 to cover over 1.1 million Unicode code points using 2–4 bytes, whereas ASCII stops at 127.
Why does ASCII use 7 bits instead of 8?
The 1963 committee reserved the 8th bit for parity error-checking on serial lines. This left 128 slots — enough for the English alphabet, digits, punctuation, and control characters. The 8th bit was later used by competing extended encodings like ISO-8859-1.

Related

Published May 14, 2026 · Last reviewed May 31, 2026