Glossary
Lossless compression
Compression that preserves every byte
By Buğra SözeriPublished Updated
Lossless compression reduces file size while preserving every byte of the original. Decompressing produces output bit-identical to the input. Tradeoff: smaller savings than lossy compression — typically 30-70% size reduction depending on content.
How it works: lossless algorithms find statistical patterns (repeated substrings, predictable sequences) and encode them with shorter representations. Two classic families:
- Dictionary-based (LZ77, LZ78, LZW): build a dictionary of seen substrings and emit back-references. The basis of DEFLATE, gzip, ZIP.
- Entropy coding (Huffman, arithmetic coding, ANS): assign shorter binary codes to more frequent symbols. Typically combined with dictionary methods.
Common lossless formats:
- PNG — images (uses DEFLATE)
- FLAC — audio (preserves 16-24 bit PCM, typically 50-60% the size of WAV)
- ZIP, gzip, Brotli, Zstandard — general data
- WebP and AVIF — both support lossless modes
- Git pack files — source-code repository storage
Use lossless when you need bit-perfect reproduction, when the content will be edited further, or when the file is text/structured data (which doesn’t compress well lossily anyway).
The information-theoretic ceiling: Claude Shannon’s 1948 paper established that lossless compression cannot drop below the source’s entropy — the average information per symbol. For random data (random bytes, encrypted ciphertext, already-compressed files), the entropy is maximal and lossless compression achieves essentially zero savings. This is why “gzip image.jpg” gains almost nothing; the JPEG bytes already look random to a compressor. The corollary: if your compression ratio is suspiciously good on data that should be high-entropy, you’ve probably found a bug.
Lossless on lossy data — when it pays: a common confusion is reaching for FLAC over a 128 kbps MP3 source, expecting better audio quality. The MP3 has already discarded information; FLAC just losslessly preserves the discarded version. For audio that originated as 16-bit PCM (CDs, studio masters), FLAC is the right archival choice. For audio that originated lossy, transcoding to FLAC only inflates the file. The general rule: store the master in the highest-quality lossless format that the source supports; deliver via the best lossy format the consumer can play. Related: DEFLATE, lossy, entropy. Reference: Shannon CE, A Mathematical Theory of Communication (Bell Syst Tech J, 1948).
Worked example: compressing a 10 MB log file
A typical 10 MB application log (JSON lines with timestamps, level, message, repeated field names) is highly redundant. Real-world numbers from a recent benchmark on the same input: gzip default level ≈ 1.6 MB (84% reduction, 0.2 s encode), Brotli level 6 ≈ 1.1 MB (89%, 0.5 s), Zstandard level 3 ≈ 1.3 MB (87%, 0.05 s), Zstandard level 19 ≈ 0.9 MB (91%, 1.8 s). Random bytes (10 MB from /dev/urandom) compress to within a few bytes of 10 MB in every algorithm — incompressible because high entropy. Already-PNG images shrink another 1-3% under gzip -9, which is why HTTP servers typically skip Content-Encoding: gzip on PNG/JPEG/MP4 responses to save CPU.
Choosing an algorithm in 2026
For web delivery: Brotli at quality 5-6 for static assets (best ratio at acceptable encode time, supported in every modern browser since 2017), gzip as fallback for legacy clients. For internal storage and pipelines: Zstandard, which dominates the compression-ratio-vs-speed Pareto frontier at most quality levels and is now the default in tar, Linux kernel modules, RocksDB, and the npm package format. For archival of irreplaceable masters: still use a wrapper that includes a checksum (xz with SHA-256, or zip with CRC + external SHA-256) — compression itself does not detect bitrot. Reference: RFC 8878 — Zstandard Compression and the application/zstd Media Type.
Frequently asked questions
- What is lossless compression?
- Lossless compression reduces file size using algorithms (like DEFLATE, LZ77, or Huffman coding) that encode redundancy, allowing the original data to be reconstructed exactly. No information is discarded.
- What are common examples of lossless formats?
- PNG and WebP-lossless for images, FLAC and ALAC for audio, ZIP and GZIP for files, and GIF (limited palette) are all lossless. Decompressing them always yields bit-for-bit identical data to the original.
- What is the difference between lossless and lossy compression?
- Lossless compression preserves every bit; lossy compression discards information the encoder deems imperceptible (JPEG quantisation, MP3 frequency masking) to achieve higher compression ratios. Lossy files cannot be perfectly restored.
- When should I choose lossless over lossy?
- Use lossless for source assets, documents, code, and anything that will be edited or re-compressed — repeated lossy re-encoding accumulates artefacts. Use lossy for delivery formats (web images, streaming audio) where file size matters more than perfect fidelity.
Related
Published May 15, 2026 · Last reviewed May 31, 2026