# Better data compressors ## Contents ## Newer compressors {#list} You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code. Newer, better compressors include the following. ### Zstandard {#zstd} [Repository](https://github.com/facebook/zstd). According to my tests `zstd -7` compresses as fast as or faster than `gzip -9` on a wide range of hardware with a better compression ratio. `zstd -7 --long` results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do). ### Long Range Zip {#lrzip} [Repository](https://github.com/ckolivas/lrzip). `lrzip -z -L 3` is almost as good as `xz -9` on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups. ## A shell script for comparing compressors {#compbench} This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage. [Download](/compbench). ```sh #! /bin/sh # compbench, a compressor benchmarking script. # Tested on Ubuntu 24.04, Debian 12, # FreeBSD 14.0-RELEASE, NetBSD 10.0, and OpenBSD 7.5. # # Copyright (c) 2020-2024 D. Bohdan # # Permission to use, copy, modify, and/or distribute this software # for any purpose with or without fee is hereby granted. # # THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL # WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED # WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE # AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR # CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM # LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, # NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN # CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. usage() { printf 'usage: compbench [-m] [-p] [--] file [command1 command2 ...]\n\n' printf 'options:\n' printf ' -m\n' printf ' output Markdown\n' printf ' -p\n' printf ' show progress using pv(1)\n\n' printf 'example:\n' printf " \$ compbench test.tar cat lz4 'gzip -6' 'zstd -7 --long'\\n" } for arg in "$@"; do if [ "$arg" = -h ] || [ "$arg" = --help ]; then usage exit 0 fi done heading_format='=== %s\n' size_format='%8.2f' time_format='%2u:%02u.%02u' use_pv=0 while getopts mp opt; do case "$opt" in # Markdown mode. m) # This will produce invalid Markdown if a compressor command contains # '`'. # shellcheck disable=SC2016 heading_format='### `%s`\n\n' size_format='- %.2f' time_format='- %u:%02u.%02u' ;; # Enable pv(1). p) use_pv=1 ;; ?) exit 2 ;; esac done shift $((OPTIND - 1)) file="$1" if [ -z "$file" ]; then printf 'no file argument\n' >/dev/stderr exit 2 fi if [ ! -e "$file" ]; then printf "file doesn't exist\\n" >/dev/stderr exit 1 fi if [ -d "$file" ]; then printf 'file argument is a directory\n' >/dev/stderr exit 1 fi shift orig_size="$(wc -c "$file" | awk '{ print $1 }')" temp_size="$(mktemp)" temp_time="$(mktemp)" clean_up() { rm "$temp_size" "$temp_time" } trap clean_up EXIT if [ "$(uname)" = Linux ]; then # GNU, BusyBox. arg1=-f arg2q='%e real\n%M maximum resident size' else # DragonFly/Free/Net/OpenBSD. arg1= arg2q=-l fi first=1 for comp in "$@"; do if [ "$first" = 1 ]; then first=0 else printf '\n' fi # shellcheck disable=SC2059 printf "$heading_format" "$comp" # shellcheck disable=SC2086 if [ "$use_pv" -eq 1 ]; then pv "$file" | command time $arg1 "$arg2q" $comp 2>"$temp_time" else command time $arg1 "$arg2q" $comp <"$file" 2>"$temp_time" fi | wc -c | awk \ -v "orig_size=$orig_size" \ -v "size_format=$size_format" \ ' { printf size_format " MiB compressed\n", $1 / 1024 / 1024 printf size_format " ratio\n", $1 / orig_size } ' \ >"$temp_size" \ ; awk \ -v "size_format=$size_format" \ -v "time_format=$time_format" \ ' /real/ { m = $1 / 60 s = $1 % 60 cs = $1 * 100 % 100 printf time_format " elapsed\n", m, s, cs } /maximum resident/ { printf size_format " MiB max RSS\n", $1 / 1024 } ' \ "$temp_time" \ ; cat "$temp_size" done ``` ## An [MTGJSON](https://mtgjson.com/) test {#test} The file `AllPrintings.json` was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631. ### Results
| Compressor | Compression ratio | Compressed size (MiB) | Elapsed time (wall clock) | Max resident set (MiB) |
|---|---|---|---|---|
lz4 |
0.36 | 69.34 | 0:01.09 | 7.08 |
gzip -9 |
0.23 | 45.20 | 0:13.01 | 1.89 |
zstd -7 |
0.16 | 31.60 | 0:10.71 | 40.09 |
bzip2 -9 |
0.15 | 28.39 | 0:37.99 | 8.56 |
zstd -7 --long |
0.14 | 27.25 | 0:10.80 | 168.34 |
lrzip -z -L 3 |
0.12 | 23.19 | 0:40.41 | 342.72 |
xz -9 |
0.10 | 19.39 | 2:38.82 | 675.51 |