Better data compressors

Contents

Newer compressors

You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code.

Newer, better compressors include the following.

Zstandard

Repository.

According to my tests zstd -7 compresses as fast as or faster than gzip -9 on a wide range of hardware with a better compression ratio. zstd -7 --long results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do).

Long Range Zip

Repository.

lrzip -z -L 3 is almost as good as xz -9 on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups.

A shell script for comparing compressors

This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.

Download.

#! /bin/sh
# compbench, a compressor benchmarking script.
# Tested on Ubuntu 22.04, Debian GNU/Linux 11,
# FreeBSD 13.1-RELEASE, NetBSD 9.3, and OpenBSD 7.2.
#
# Copyright (c) 2020-2023 D. Bohdan
#
# Permission to use, copy, modify, and/or distribute this software
# for any purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

usage() {
    printf 'usage: compbench [-m] file command1 [command2 ...]\n\n'
    printf 'options:\n'
    # shellcheck disable=SC2016
    printf '  -m\n'
    printf '    output Markdown\n\n'
    printf 'example:\n'
    printf "  \$ compbench test.tar cat lz4 'gzip -9' 'zstd -7 --long'\\n"
}

for arg in "$@"; do
    if [ "$arg" = -h ] || [ "$arg" = --help ]; then
        usage
        exit 0
    fi
done

heading_format='=== %s\n'
size_format='%8.2f'
time_format='%2u:%02u.%02u'

# Markdown mode.
if [ "$1" = -m ]; then
    # This will produce invalid Markdown if a compressor command contains
    # '`'.
    # shellcheck disable=SC2016
    heading_format='### `%s`\n\n'
    size_format='- %.2f'
    time_format='- %u:%02u.%02u'
    shift
fi

file="$1"
if [ -z "$file" ]; then
    echo 'no file argument'
    exit 2
fi
if [ ! -e "$file" ]; then
    echo "file doesn't exist"
    exit 1
fi
if [ -d "$file" ]; then
    echo 'file argument is a directory'
    exit 1
fi
shift

orig_size="$(wc -c "$file" | awk '{ print $1 }')"
temp_size="$(mktemp)"
temp_time="$(mktemp)"

clean_up() {
    rm "$temp_size" "$temp_time"
}
trap clean_up EXIT

if [ "$(uname)" = Linux ]; then
    # GNU, BusyBox.
    arg1=-f
    arg2q='%e real\n%M maximum resident size'
else
    # DragonFly/Free/Net/OpenBSD.
    arg1=
    arg2q=-l
fi

first=1
for comp in "$@"; do
    if [ "$first" = 1 ]; then
        first=0
    else
        printf '\n'
    fi

    # shellcheck disable=SC2059
    printf "$heading_format" "$comp"

    # shellcheck disable=SC2086
    command time $arg1 "$arg2q" $comp < "$file" 2> "$temp_time" \
    | wc -c \
    | awk \
        -v "orig_size=$orig_size" \
        -v "size_format=$size_format" \
        '
            {
                printf size_format " MiB compressed\n", $1 / 1024 / 1024
                printf size_format " ratio\n", $1 / orig_size
            }
        ' \
        > "$temp_size" \
        ;

    awk \
        -v "size_format=$size_format" \
        -v "time_format=$time_format" \
        '
            /real/ {
                m = $1 / 60
                s = $1 % 60
                cs = $1 * 100 % 100
                printf time_format " elapsed\n", m, s, cs
            }

            /maximum resident/ {
                printf size_format " MiB max RSS\n", $1 / 1024
            }
        ' \
        "$temp_time" \
        ;

    cat "$temp_size"
done

An MTGJSON test

The file AllPrintings.json was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.

Results

Compressor Compression ratio Compressed size (MiB) Elapsed time (wall clock) Max resident set (MiB)
lz4 0.36 69.34 0:01.09 7.08
gzip -9 0.23 45.20 0:13.01 1.89
zstd -7 0.16 31.60 0:10.71 40.09
bzip2 -9 0.15 28.39 0:37.99 8.56
zstd -7 –long 0.14 27.25 0:10.80 168.34
lrzip -z -L 3 0.12 23.19 0:40.41 342.72
xz -9 0.10 19.39 2:38.82 675.51

See also