Skip to content

Compression

Compression means making data smaller.

Compression means making data smaller.

A text file may contain repeated words. A log file may contain repeated timestamps. A binary file may contain repeated patterns. Compression algorithms find these patterns and store the same information using fewer bytes.

For example, this text has obvious repetition:

hello hello hello hello hello

A compressor can store it more compactly than the original bytes.

Compression is useful for:

saving disk space

sending less data over a network

storing large logs

packaging files

working with archive formats

reducing bandwidth costs

But compression has a cost. The program must spend CPU time compressing and decompressing data.

Compression and Decompression

There are two directions.

Compression turns original data into compressed data.

plain bytes -> compressed bytes

Decompression turns compressed data back into the original data.

compressed bytes -> plain bytes

A correct decompressor should recover exactly the original bytes.

If the original data is:

abcabcabc

then after compression and decompression, the result should still be:

abcabcabc

Compression should not change the meaning of the data.

Lossless and Lossy Compression

There are two broad kinds of compression.

Lossless compression preserves the exact original bytes.

Examples:

gzip
zlib
zstd
deflate
lzma

Lossy compression does not preserve the exact original data. It removes details to make the result smaller.

Examples:

JPEG
MP3
AAC
some video codecs

For source code, JSON, logs, databases, executables, and archives, you need lossless compression.

If one byte changes, the file may become invalid.

This section is about lossless compression.

Common Compression Formats

You will see several names often.

NameCommon use
gzip.gz files, HTTP compression, logs
zlibcompressed data format used by many systems
deflatecompression algorithm used inside zlib and gzip
zstdmodern fast compression with good ratios
lzmahigh compression ratio, often slower
tararchive format, often combined with compression

Be careful with vocabulary.

tar is not compression by itself. It combines many files into one archive.

gzip compresses one byte stream.

So this file:

archive.tar.gz

usually means:

First, many files were packed into one .tar.

Then the .tar stream was compressed with gzip.

Working with Compressed Data

At a high level, compression code usually looks like this:

const compressed = try compress(allocator, original);
defer allocator.free(compressed);

const restored = try decompress(allocator, compressed);
defer allocator.free(restored);

The exact APIs depend on the compression format and Zig version.

But the resource rules are familiar:

compression may allocate

decompression may allocate

both can fail

allocated buffers must be freed

compressed data is still just bytes

Why Decompression Can Fail

Decompression can fail for many reasons.

The input may not be compressed data.

The input may use the wrong compression format.

The compressed data may be corrupted.

The output may be too large.

The allocator may fail.

The stream may end early.

So decompression returns errors.

Do not treat compressed data as trusted input. A malformed compressed file can be accidental or malicious.

Compression Ratio

Compression ratio measures how much smaller the data becomes.

Suppose the original data is 1000 bytes.

The compressed data is 250 bytes.

The compressed size is one quarter of the original size.

That is a good compression result.

But not all data compresses well.

This text compresses well:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

This random-looking data may not:

8f 2a 90 11 c4 7b e2 09

Already compressed files often do not compress much more. Compressing a .jpg, .mp4, or .zip again may waste CPU and produce little benefit.

Compression Speed vs Size

Compression algorithms make tradeoffs.

Some are very fast but produce larger output.

Some produce smaller output but take more CPU time.

A log pipeline may prefer speed.

An archive stored for years may prefer size.

A network service may need a balance.

Many compressors have levels.

A low level is faster.

A high level may compress smaller.

Example conceptually:

level 1  -> faster, larger output
level 9  -> slower, smaller output

Do not assume the highest level is always best. Measure with your real data.

Streaming Compression

Small data can be compressed all at once.

Large data should usually be compressed as a stream.

Streaming means processing data in chunks:

read chunk -> compress chunk -> write chunk
read chunk -> compress chunk -> write chunk
read chunk -> compress chunk -> write chunk

This avoids loading the whole file into memory.

The same idea applies to decompression:

read compressed chunk -> decompress chunk -> write plain chunk

Streaming is important for large files, network connections, and memory-constrained programs.

Buffer-Based Compression

For small inputs, buffer-based compression is simpler.

Conceptual shape:

const input = "hello hello hello hello\n";

const compressed = try compressToBuffer(allocator, input);
defer allocator.free(compressed);

const plain = try decompressToBuffer(allocator, compressed);
defer allocator.free(plain);

This is easy to understand, but it requires enough memory for the input, compressed output, and decompressed output.

For small configuration data or tests, that is fine.

For large files, prefer streaming.

File Compression Pattern

A command-line compression tool usually follows this shape:

open input file
defer close input file

open output file
defer close output file

create compressor

while true:
    read input chunk
    if end: break
    write compressed chunk

finish compressor

The final “finish” step matters.

Many compression formats need to write footer data, checksums, or final buffered bytes.

If you forget to finish or flush the compressor, the output file may be incomplete.

File Decompression Pattern

A decompression tool has the reverse shape:

open compressed file
defer close compressed file

open output file
defer close output file

create decompressor

while true:
    read compressed chunk
    if end: break
    write plain chunk

The decompressor must validate the input.

If the compressed stream is malformed, the program should return an error and avoid treating partial output as valid.

A Simple Design for a Compression API

If you write your own wrapper, make the direction clear.

Good names:

compressGzip
decompressGzip
compressZstd
decompressZstd

Avoid vague names:

process
convert
handleData

Compression code has too many similar byte streams. Clear names prevent mistakes.

A good function signature might look like:

fn compressData(
    allocator: std.mem.Allocator,
    input: []const u8,
) ![]u8 {
    // returns allocated compressed bytes
}

This signature says:

It needs an allocator.

It reads input bytes.

It can fail.

It returns allocated output.

The caller must free the result.

Avoid Compression Bombs

A compression bomb is compressed data that expands to a huge size.

For example, a small compressed file might decompress into gigabytes of data.

If your program blindly decompresses into memory, it can run out of memory.

So when decompressing untrusted data, set limits.

Conceptually:

if (output_size > max_allowed_size) {
    return error.OutputTooLarge;
}

This is important for servers, archive tools, package managers, and anything that accepts files from users.

Checksums

Some compression formats include checksums.

A checksum helps detect corrupted data.

For example, if one byte in a compressed file changes, decompression may fail with a checksum error.

A checksum is not the same as encryption or authentication.

It can detect accidental corruption.

It does not prove that the data came from a trusted source.

For security, use cryptographic authentication such as a MAC or digital signature.

Compression Is Not Encryption

Compression makes data smaller.

Encryption makes data unreadable without a key.

They solve different problems.

Compressed data may still reveal information.

Encrypted data should hide the contents.

Do not store secrets in compressed files and assume they are protected.

When to Compress

Compression is useful when:

data has repeated patterns

data is large

storage or bandwidth matters

CPU cost is acceptable

data will be read less often than it is stored

Compression may be a poor choice when:

data is already compressed

latency matters more than size

CPU is the bottleneck

files are tiny

the format must support random access

Some formats support block compression for random access, but plain stream compression usually does not.

What Beginners Should Learn First

Do not start by memorizing every compression API.

Start with the concepts:

compressed data is bytes

decompression can fail

streaming avoids large memory use

compression has speed-size tradeoffs

finalization or flushing matters

untrusted compressed data needs limits

Then learn one specific format, such as gzip or zstd.

The Core Pattern

For small data:

const compressed = try compress(allocator, input);
defer allocator.free(compressed);

const output = try decompress(allocator, compressed);
defer allocator.free(output);

For files:

open input
defer close input

open output
defer close output

while reading chunks:
    compress or decompress
    write result

finish or flush

For safety:

if (decompressed_size > max_size) {
    return error.OutputTooLarge;
}

What You Should Remember

Compression makes data smaller.

Decompression restores the original bytes.

Lossless compression is required for code, text data, archives, databases, and structured files.

Lossy compression is for media where exact bytes are not required.

Compressed data is still byte data.

Compression can fail.

Decompression can fail.

Large files should be processed in chunks.

Always finish or flush compression streams.

Do not trust compressed input blindly.

Compression saves space or bandwidth by spending CPU time.