Better data compressors
You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code.
Newer, better compressors include the following.
According to my tests zstd -7
compresses as fast as or faster than gzip -9
on a wide range of hardware with a better compression ratio. zstd -7 --long
results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do).
lrzip -z -L 3
is almost as good as xz -9
on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups.
This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.
#! /bin/sh
# compbench, a compressor benchmarking script.
# Tested on Ubuntu 22.04, Debian GNU/Linux 11,
# FreeBSD 13.1-RELEASE, NetBSD 9.3, and OpenBSD 7.2.
#
# Copyright (c) 2020-2023 D. Bohdan
#
# Permission to use, copy, modify, and/or distribute this software
# for any purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
usage() {
printf 'usage: compbench [-m] file command1 [command2 ...]\n\n'
printf 'options:\n'
# shellcheck disable=SC2016
printf ' -m\n'
printf ' output Markdown\n\n'
printf 'example:\n'
printf " \$ compbench test.tar cat lz4 'gzip -9' 'zstd -7 --long'\\n"
}
for arg in "$@"; do
if [ "$arg" = -h ] || [ "$arg" = --help ]; then
usage
exit 0
fi
done
heading_format='=== %s\n'
size_format='%8.2f'
time_format='%2u:%02u.%02u'
# Markdown mode.
if [ "$1" = -m ]; then
# This will produce invalid Markdown if a compressor command contains
# '`'.
# shellcheck disable=SC2016
heading_format='### `%s`\n\n'
size_format='- %.2f'
time_format='- %u:%02u.%02u'
shift
fi
file="$1"
if [ -z "$file" ]; then
echo 'no file argument'
exit 2
fi
if [ ! -e "$file" ]; then
echo "file doesn't exist"
exit 1
fi
if [ -d "$file" ]; then
echo 'file argument is a directory'
exit 1
fi
shift
orig_size="$(wc -c "$file" | awk '{ print $1 }')"
temp_size="$(mktemp)"
temp_time="$(mktemp)"
clean_up() {
rm "$temp_size" "$temp_time"
}
trap clean_up EXIT
if [ "$(uname)" = Linux ]; then
# GNU, BusyBox.
arg1=-f
arg2q='%e real\n%M maximum resident size'
else
# DragonFly/Free/Net/OpenBSD.
arg1=
arg2q=-l
fi
first=1
for comp in "$@"; do
if [ "$first" = 1 ]; then
first=0
else
printf '\n'
fi
# shellcheck disable=SC2059
printf "$heading_format" "$comp"
# shellcheck disable=SC2086
command time $arg1 "$arg2q" $comp < "$file" 2> "$temp_time" \
| wc -c \
| awk \
-v "orig_size=$orig_size" \
-v "size_format=$size_format" \
'
{
printf size_format " MiB compressed\n", $1 / 1024 / 1024
printf size_format " ratio\n", $1 / orig_size
}
' \
> "$temp_size" \
;
awk \
-v "size_format=$size_format" \
-v "time_format=$time_format" \
'
/real/ {
m = $1 / 60
s = $1 % 60
cs = $1 * 100 % 100
printf time_format " elapsed\n", m, s, cs
}
/maximum resident/ {
printf size_format " MiB max RSS\n", $1 / 1024
}
' \
"$temp_time" \
;
cat "$temp_size"
done
The file AllPrintings.json
was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.
Compressor | Compression ratio | Compressed size (MiB) | Elapsed time (wall clock) | Max resident set (MiB) |
---|---|---|---|---|
lz4
|
0.36 | 69.34 | 0:01.09 | 7.08 |
gzip -9
|
0.23 | 45.20 | 0:13.01 | 1.89 |
zstd -7
|
0.16 | 31.60 | 0:10.71 | 40.09 |
bzip2 -9
|
0.15 | 28.39 | 0:37.99 | 8.56 |
zstd -7 –long
|
0.14 | 27.25 | 0:10.80 | 168.34 |
lrzip -z -L 3
|
0.12 | 23.19 | 0:40.41 | 342.72 |
xz -9
|
0.10 | 19.39 | 2:38.82 | 675.51 |
- “Use Fast Data Algorithms”, Joey Lynch2021