Better data compressors

Contents

Newer compressors

You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code.

Newer, better compressors include the following.

Zstandard

GitHub.

According to my tests zstd -7 compresses as fast as or faster than gzip -9 on a wide range of hardware with a better compression ratio. zstd -7 --long results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do).

Long Range Zip

GitHub.

lrzip -z -L 3 is almost as good as xz -9 on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups.

A shell script for comparing compressors

This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage.

Download.

#! /bin/sh
# compbench, a compressor benchmarking script.
# usage: compbench file command1 [command2 ...]
# For example:
# $ compbench test.tar cat lz4 'gzip -9' 'zstd --long -19'
# Tested on Ubuntu 22.04, Debian GNU/Linux 11,
# FreeBSD 13.1-RELEASE, NetBSD 9.3, and OpenBSD 7.2.
#
# Copyright (c) 2020-2023 D. Bohdan
#
# Permission to use, copy, modify, and/or distribute this software
# for any purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

file="$1"
if [ ! -f "$file" ]; then
  echo 'no file'
  exit 1
fi
shift
if [ -z "$1" ]; then
  echo 'no compressor commands'
  exit 1
fi

origsize="$(wc -c "$file" | awk '{ print $1 }')"
tempsize="$(mktemp)"
temptime="$(mktemp)"
cleanup() {
  rm "$tempsize" "$temptime"
}
trap cleanup EXIT

if [ "$(uname)" = Linux ]; then
  # GNU, BusyBox.
  arg1=-f
  arg2q='%e real\n%M maximum resident size'
else
  # DragonFly/Free/Net/OpenBSD.
  arg1=
  arg2q=-l
fi

first=1
for comp in "$@"; do
  if [ "$first" = 1 ]; then
    first=0
  else
    printf '\n'
  fi
  echo "=== $comp"

  command time $arg1 "$arg2q" $comp < "$file" 2> "$temptime" \
  | wc -c \
  | awk -v "origsize=$origsize" '
      {
        printf "%8.2f MiB compressed\n", $1 / 1024 / 1024
        if (origsize > 0) {
          printf "%8.2f ratio\n", $1 / origsize
        }
      }
    ' > "$tempsize"

  awk '
    /real/ {
      m = $1 / 60
      s = $1 % 60
      cs = $1 * 100 % 100
      printf "%2u:%02u.%02u elapsed\n", m, s, cs
    }
    /maximum resident/ {
      printf "%8.2f MiB max RSS\n", $1 / 1024
    }
  ' "$temptime"

  cat "$tempsize"
done

An MTGJSON test

The file AllPrintings.json was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631.

Results

Compressor Compression ratio Compressed size (MiB) Elapsed time (wall clock) Max resident set (MiB)
lz4 0.36 69.34 0:01.09 7.08
gzip -9 0.23 45.20 0:13.01 1.89
zstd -7 0.16 31.60 0:10.71 40.09
bzip2 -9 0.15 28.39 0:37.99 8.56
zstd -7 –long 0.14 27.25 0:10.80 168.34
lrzip -z -L 3 0.12 23.19 0:40.41 342.72
xz -9 0.10 19.39 2:38.82 675.51

See also