Better data compressors
Contents
Newer compressors
You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code.
Newer, better compressors include the following.
Zstandard
According to my tests zstd -7
compresses as fast as or faster than gzip -9
on a wide range of hardware with a better compression ratio. zstd -7 --long
results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do).
Long Range Zip
lrzip -z -L 3
is almost as good as xz -9
on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups.
A shell script for comparing compressors
This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.
#! /bin/sh
# compbench, a compressor benchmarking script.
# usage: compbench file command1 [command2 ...]
# For example:
# $ compbench test.tar cat lz4 'gzip -9' 'zstd --long -19'
# Tested on Ubuntu 22.04, Debian GNU/Linux 11,
# FreeBSD 13.1-RELEASE, NetBSD 9.3, and OpenBSD 7.2.
#
# Copyright (c) 2020-2023 D. Bohdan
#
# Permission to use, copy, modify, and/or distribute this software
# for any purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
file="$1"
if [ ! -f "$file" ]; then
echo 'no file'
exit 1
fi
shift
if [ -z "$1" ]; then
echo 'no compressor commands'
exit 1
fi
origsize="$(wc -c "$file" | awk '{ print $1 }')"
tempsize="$(mktemp)"
temptime="$(mktemp)"
cleanup() {
rm "$tempsize" "$temptime"
}
trap cleanup EXIT
if [ "$(uname)" = Linux ]; then
# GNU, BusyBox.
arg1=-f
arg2q='%e real\n%M maximum resident size'
else
# DragonFly/Free/Net/OpenBSD.
arg1=
arg2q=-l
fi
first=1
for comp in "$@"; do
if [ "$first" = 1 ]; then
first=0
else
printf '\n'
fi
echo "=== $comp"
command time $arg1 "$arg2q" $comp < "$file" 2> "$temptime" \
| wc -c \
| awk -v "origsize=$origsize" '
{
printf "%8.2f MiB compressed\n", $1 / 1024 / 1024
if (origsize > 0) {
printf "%8.2f ratio\n", $1 / origsize
}
}
' > "$tempsize"
awk '
/real/ {
m = $1 / 60
s = $1 % 60
cs = $1 * 100 % 100
printf "%2u:%02u.%02u elapsed\n", m, s, cs
}
/maximum resident/ {
printf "%8.2f MiB max RSS\n", $1 / 1024
}
' "$temptime"
cat "$tempsize"
done
An MTGJSON test
The file AllPrintings.json
was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.
Results
Compressor | Compression ratio | Compressed size (MiB) | Elapsed time (wall clock) | Max resident set (MiB) |
---|---|---|---|---|
lz4
|
0.36 | 69.34 | 0:01.09 | 7.08 |
gzip -9
|
0.23 | 45.20 | 0:13.01 | 1.89 |
zstd -7
|
0.16 | 31.60 | 0:10.71 | 40.09 |
bzip2 -9
|
0.15 | 28.39 | 0:37.99 | 8.56 |
zstd -7 –long
|
0.14 | 27.25 | 0:10.80 | 168.34 |
lrzip -z -L 3
|
0.12 | 23.19 | 0:40.41 | 342.72 |
xz -9
|
0.10 | 19.39 | 2:38.82 | 675.51 |
See also
- “Use Fast Data Algorithms”, Joey Lynch (2021)