Better data compressors
Contents
Lesser-known compressors
You can gain in performance and/or compression ratio if you use a more advanced but less common data compressor. These include:
- Zstandard. According to my tests
zstd -7
compresses as fast as or faster thangzip -9
on a wide range of hardware with a better compression ratio.zstd -7 --long
results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely deployed. I would use it for backups and long-term data archival (and I do!). - Long Range Zip.
lrzip -z -L 3
is almost as good asxz -9
on a large collection of JSON files but compresses 5x faster.
A shell script for comparing compressors
This script requires GNU time(1). It runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.
#! /bin/sh
# compbench, a compressor benchmarking script.
# Copyright (c) 2020-2022 D. Bohdan
#
# Permission to use, copy, modify, and/or distribute this software
# for any purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
file="$1"
if [ ! -f "$file" ]; then
echo 'no file'
exit 1
fi
size="$(wc -c "$file")"
shift
for comp in "$@"; do
echo "== $comp"
command time --format 'elapsed %E\nmax %M' $comp < "$file" \
| wc -c \
| awk -v "size=$size" '
{
printf "%.2f MiB\n", $1 / 1024.0 / 1024
if (size > 0) {
printf "%.2f\n", $1 / size
}
}'
done
Usage
./compbench.sh dir/file.ext cat lz4 'gzip -9' $
An MTGJSON test
The file AllPrintings.json
was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.
Results
Compressor | Compression ratio | Compressed size (MiB) | Elapsed time (wall clock) | Max resident set (MiB) |
---|---|---|---|---|
lz4
|
0.36 | 69.34 | 0:01.09 | 7.08 |
gzip -9
|
0.23 | 45.20 | 0:13.01 | 1.89 |
zstd -7
|
0.16 | 31.60 | 0:10.71 | 40.09 |
bzip2 -9
|
0.15 | 28.39 | 0:37.99 | 8.56 |
zstd -7 –long
|
0.14 | 27.25 | 0:10.80 | 168.34 |
lrzip -z -L 3
|
0.12 | 23.19 | 0:40.41 | 342.72 |
xz -9
|
0.10 | 19.39 | 2:38.82 | 675.51 |
Tags: comparison, compression.