# Better data compressors ## Contents ## Newer compressors {#list} You can gain in performance, compression ratio, or both if you use a newer, more advanced data compressor. The main downside is a smaller install base, which matters if you want to share compressed files. The easiest place to adopt a new compressor is your own systems and code. Newer, better compressors include the following. ### Zstandard {#zstd} [Repository](https://github.com/facebook/zstd). According to my tests `zstd -7` compresses as fast as or faster than `gzip -9` on a wide range of hardware with a better compression ratio. `zstd -7 --long` results in an even better ratio, though it uses several times more RAM. Zstandard is mature, maintained, and increasingly widely adopted. I would use it for backups and long-term archival (and do). ### Long Range Zip {#lrzip} [Repository](https://github.com/ckolivas/lrzip). `lrzip -z -L 3` is almost as good as `xz -9` on a large collection of JSON files but compresses 5x faster. With the right settings it can often achieve a higher compression ratio in less time than Zstandard but is less mature. I have had lrzip crash on rare occasions. I would use it for data transfer and non-critical backups. ## A shell script for comparing compressors {#compbench} This script runs different commands against the same input and prints the resulting compressed size, compression ratio, compression time, and peak memory usage. [Download](/compbench). ```sh #! /bin/sh # compbench, a compressor benchmarking script. # Tested on Ubuntu 24.04, Debian 12, # FreeBSD 14.0-RELEASE, NetBSD 10.0, and OpenBSD 7.5. # # Copyright (c) 2020-2024 D. Bohdan # # Permission to use, copy, modify, and/or distribute this software # for any purpose with or without fee is hereby granted. # # THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL # WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED # WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE # AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR # CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM # LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, # NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN # CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. usage() { printf 'usage: compbench [-m] [-p] [--] file [command1 command2 ...]\n\n' printf 'options:\n' printf ' -m\n' printf ' output Markdown\n' printf ' -p\n' printf ' show progress using pv(1)\n\n' printf 'example:\n' printf " \$ compbench test.tar cat lz4 'gzip -6' 'zstd -7 --long'\\n" } for arg in "$@"; do if [ "$arg" = -h ] || [ "$arg" = --help ]; then usage exit 0 fi done heading_format='=== %s\n' size_format='%8.2f' time_format='%2u:%02u.%02u' use_pv=0 while getopts mp opt; do case "$opt" in # Markdown mode. m) # This will produce invalid Markdown if a compressor command contains # '`'. # shellcheck disable=SC2016 heading_format='### `%s`\n\n' size_format='- %.2f' time_format='- %u:%02u.%02u' ;; # Enable pv(1). p) use_pv=1 ;; ?) exit 2 ;; esac done shift $((OPTIND - 1)) file="$1" if [ -z "$file" ]; then printf 'no file argument\n' >/dev/stderr exit 2 fi if [ ! -e "$file" ]; then printf "file doesn't exist\\n" >/dev/stderr exit 1 fi if [ -d "$file" ]; then printf 'file argument is a directory\n' >/dev/stderr exit 1 fi shift orig_size="$(wc -c "$file" | awk '{ print $1 }')" temp_size="$(mktemp)" temp_time="$(mktemp)" clean_up() { rm "$temp_size" "$temp_time" } trap clean_up EXIT if [ "$(uname)" = Linux ]; then # GNU, BusyBox. arg1=-f arg2q='%e real\n%M maximum resident size' else # DragonFly/Free/Net/OpenBSD. arg1= arg2q=-l fi first=1 for comp in "$@"; do if [ "$first" = 1 ]; then first=0 else printf '\n' fi # shellcheck disable=SC2059 printf "$heading_format" "$comp" # shellcheck disable=SC2086 if [ "$use_pv" -eq 1 ]; then pv "$file" | command time $arg1 "$arg2q" $comp 2>"$temp_time" else command time $arg1 "$arg2q" $comp <"$file" 2>"$temp_time" fi | wc -c | awk \ -v "orig_size=$orig_size" \ -v "size_format=$size_format" \ ' { printf size_format " MiB compressed\n", $1 / 1024 / 1024 printf size_format " ratio\n", $1 / orig_size } ' \ >"$temp_size" \ ; awk \ -v "size_format=$size_format" \ -v "time_format=$time_format" \ ' /real/ { m = $1 / 60 s = $1 % 60 cs = $1 * 100 % 100 printf time_format " elapsed\n", m, s, cs } /maximum resident/ { printf size_format " MiB max RSS\n", $1 / 1024 } ' \ "$temp_time" \ ; cat "$temp_size" done ``` ## An [MTGJSON](https://mtgjson.com/) test {#test} The file `AllPrintings.json` was version 4.6.3+20200501 and 194 MiB in size. I used zstd version 1.4.4 and lrzip version 0.631. ### Results
Compressor Compression ratio Compressed size (MiB) Elapsed time (wall clock) Max resident set (MiB)
lz4 0.36 69.34 0:01.09 7.08
gzip -9 0.23 45.20 0:13.01 1.89
zstd -7 0.16 31.60 0:10.71 40.09
bzip2 -9 0.15 28.39 0:37.99 8.56
zstd -7 --long 0.14 27.25 0:10.80 168.34
lrzip -z -L 3 0.12 23.19 0:40.41 342.72
xz -9 0.10 19.39 2:38.82 675.51
## See also - ["Smaller and faster data compression with Zstandard"](https://engineering.fb.com/2016/08/31/core-infra/smaller-and-faster-data-compression-with-zstandard/), Yann Collet, Chip Turner (2016) - ["Better Compression with Zstandard"](https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/), Gregory Szorc (2017) - ["Use Fast Data Algorithms"](https://jolynch.github.io/posts/use_fast_data_algorithms/), Joey Lynch (2021) ## Page metadata URL: Published 2020-05-03, updated 2024-12-24. Tags: - comparison - compression - data - shell - tools