Better data compressors

You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code.

Newer, better compressors include the following.

Repository.

According to my tests zstd -7 compresses as fast as or faster than gzip -9 on a wide range of hardware with a better compression ratio. zstd -7 --long results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do).

Repository.

lrzip -z -L 3 is almost as good as xz -9 on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups.

This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.

Download.

#! /bin/sh
                        # compbench, a compressor benchmarking script.
                        # Tested on Ubuntu 24.04, Debian 12,
                        # FreeBSD 14.0-RELEASE, NetBSD 10.0, and OpenBSD 7.5.
                        #
                        # Copyright (c) 2020-2024 D. Bohdan
                        #
                        # Permission to use, copy, modify, and/or distribute this software
                        # for any purpose with or without fee is hereby granted.
                        #
                        # THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
                        # WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
                        # WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
                        # AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
                        # CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
                        # LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
                        # NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
                        # CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
                        
                        usage() {
                            printf 'usage: compbench [-m] [-p] [--] file [command1 command2 ...]\n\n'
                            printf 'options:\n'
                            printf '  -m\n'
                            printf '    output Markdown\n'
                            printf '  -p\n'
                            printf '    show progress using pv(1)\n\n'
                            printf 'example:\n'
                            printf "  \$ compbench test.tar cat lz4 'gzip -6' 'zstd -7 --long'\\n"
                        }
                        
                        for arg in "$@"; do
                            if [ "$arg" = -h ] || [ "$arg" = --help ]; then
                                usage
                                exit 0
                            fi
                        done
                        
                        heading_format='=== %s\n'
                        size_format='%8.2f'
                        time_format='%2u:%02u.%02u'
                        use_pv=0
                        
                        while getopts mp opt; do
                            case "$opt" in
                            # Markdown mode.
                            m)
                                # This will produce invalid Markdown if a compressor command contains
                                # '`'.
                                # shellcheck disable=SC2016
                                heading_format='### `%s`\n\n'
                                size_format='- %.2f'
                                time_format='- %u:%02u.%02u'
                                ;;
                            # Enable pv(1).
                            p)
                                use_pv=1
                                ;;
                            ?)
                                exit 2
                                ;;
                            esac
                        done
                        shift $((OPTIND - 1))
                        
                        file="$1"
                        if [ -z "$file" ]; then
                            printf 'no file argument\n' >/dev/stderr
                            exit 2
                        fi
                        if [ ! -e "$file" ]; then
                            printf "file doesn't exist\\n" >/dev/stderr
                            exit 1
                        fi
                        if [ -d "$file" ]; then
                            printf 'file argument is a directory\n' >/dev/stderr
                            exit 1
                        fi
                        shift
                        
                        orig_size="$(wc -c "$file" | awk '{ print $1 }')"
                        temp_size="$(mktemp)"
                        temp_time="$(mktemp)"
                        
                        clean_up() {
                            rm "$temp_size" "$temp_time"
                        }
                        trap clean_up EXIT
                        
                        if [ "$(uname)" = Linux ]; then
                            # GNU, BusyBox.
                            arg1=-f
                            arg2q='%e real\n%M maximum resident size'
                        else
                            # DragonFly/Free/Net/OpenBSD.
                            arg1=
                            arg2q=-l
                        fi
                        
                        first=1
                        for comp in "$@"; do
                            if [ "$first" = 1 ]; then
                                first=0
                            else
                                printf '\n'
                            fi
                        
                            # shellcheck disable=SC2059
                            printf "$heading_format" "$comp"
                        
                            # shellcheck disable=SC2086
                            if [ "$use_pv" -eq 1 ]; then
                                pv "$file" | command time $arg1 "$arg2q" $comp 2>"$temp_time"
                            else
                                command time $arg1 "$arg2q" $comp <"$file" 2>"$temp_time"
                            fi |
                                wc -c |
                                awk \
                                    -v "orig_size=$orig_size" \
                                    -v "size_format=$size_format" \
                                    '
                                        {
                                            printf size_format " MiB compressed\n", $1 / 1024 / 1024
                                            printf size_format " ratio\n", $1 / orig_size
                                        }
                                    ' \
                                    >"$temp_size" \
                                ;
                        
                            awk \
                                -v "size_format=$size_format" \
                                -v "time_format=$time_format" \
                                '
                                    /real/ {
                                        m = $1 / 60
                                        s = $1 % 60
                                        cs = $1 * 100 % 100
                                        printf time_format " elapsed\n", m, s, cs
                                    }
                        
                                    /maximum resident/ {
                                        printf size_format " MiB max RSS\n", $1 / 1024
                                    }
                                ' \
                                "$temp_time" \
                                ;
                        
                            cat "$temp_size"
                        done

The file AllPrintings.json was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.

Compressor Compression ratio Compressed size (MiB) Elapsed time (wall clock) Max resident set (MiB)
lz4 0.36 69.34 0:01.09 7.08
gzip -9 0.23 45.20 0:13.01 1.89
zstd -7 0.16 31.60 0:10.71 40.09
bzip2 -9 0.15 28.39 0:37.99 8.56
zstd -7 –long 0.14 27.25 0:10.80 168.34
lrzip -z -L 3 0.12 23.19 0:40.41 342.72
xz -9 0.10 19.39 2:38.82 675.51