Better data compressors

Contents

Lesser-known compressors

You can gain in performance and/or compression ratio if you use a more advanced but less common data compressor. These include:

  • Zstandard. According to my tests zstd -7 compresses as fast as or faster than gzip -9 on a wide range of hardware with a better compression ratio. zstd -7 --long results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely deployed. I would use it for backups and long-term data archival (and I do!).
  • Long Range Zip. lrzip -z -L 3 is almost as good as xz -9 on a large collection of JSON files but compresses 5x faster.

A shell script for comparing compressors

This script requires GNU time(1). It runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.

#! /bin/sh
# compbench, a compressor benchmarking script.
# Copyright (c) 2020-2022 D. Bohdan
#
# Permission to use, copy, modify, and/or distribute this software
# for any purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

file="$1"
if [ ! -f "$file" ]; then
  echo 'no file'
  exit 1
fi
size="$(wc -c "$file")"
shift

for comp in "$@"; do
  echo "== $comp"

  command time --format 'elapsed %E\nmax %M' $comp < "$file" \
  | wc -c \
  | awk -v "size=$size" '
    {
      printf "%.2f MiB\n", $1 / 1024.0 / 1024
      if (size > 0) {
        printf "%.2f\n", $1 / size
      }
    }'
done

Usage

$ ./compbench.sh dir/file.ext cat lz4 'gzip -9'

An MTGJSON test

The file AllPrintings.json was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.

Results

Compressor Compression ratio Compressed size (MiB) Elapsed time (wall clock) Max resident set (MiB)
lz4 0.36 69.34 0:01.09 7.08
gzip -9 0.23 45.20 0:13.01 1.89
zstd -7 0.16 31.60 0:10.71 40.09
bzip2 -9 0.15 28.39 0:37.99 8.56
zstd -7 –long 0.14 27.25 0:10.80 168.34
lrzip -z -L 3 0.12 23.19 0:40.41 342.72
xz -9 0.10 19.39 2:38.82 675.51

Tags: comparison, compression.