Comparison of Data Compression Programs on linux-2.6.25.tar

NOTE: This is not meant to be taken seriously for real-world applications (see the comments section).

I decided to compare different compression programs, so I downloaded the source code of linux-2.6.25 and tar-ed the directory into linux-2.6.25.tar. Then I ran 7-zip, bzip2, gzip, lpaq8, paq8o6, rzip, and sr3a on the tar file. I used the time command to time these programs (even if they have a timer themselves).

original file

The original filesize of the tar was: 284651520 bytes (271.5 MiB)

gzip

gzip -9c linux-2.6.25.tar > linux-2.6.25.tar.gz

Filesize: 61524085 bytes (58.7 MiB) (21.61%)
Time: 31.334s

gzip sucks. Please don’t use it.

sr3a

sr3a c linux-2.6.25.tar linux-2.6.25.tar.sr3

SR3 file compressor (C) 2007, Matt Mahoney
Licensed under GPL, http://www.gnu.org/copyleft/gpl.html
Modified by Nania Francesco Antonio (Italy)
linux-2.6.25.tar: 284651520 -> 51872928 in 30.53 sec.

Filesize: 51872928 bytes (49.5 MiB) (18.22%)
Time: 30.094s

Wow, faster than gzip by 1 second and compresses better as well.

bzip2

bzip2 -9k linux-2.6.25.tar

Filesize: 48564607 bytes (46.3 MiB) (17.06%)
Time: 55.911s

Average compressor.

rzip

rzip -9kvvv linux-2.6.25.tar

hashsize = 8388608.  bits = 23. 64MB
Starting sweep for mask 1
Starting sweep for mask 3
Starting sweep for mask 7
Starting sweep for mask 15
Starting sweep for mask 31
5592404 total hashes
225878 in primary bucket (4.039%)
matches=999642 match_bytes=104996777
literals=902262 literal_bytes=179654743
true_tag_positives=33727524 false_tag_positives=48478942
inserts=19225657 match 0.584
linux-2.6.25.tar - compression ratio 6.071

Filesize: 46890792 bytes (44.7 MiB) (16.47%)
Time: 58.752s

Don’t use bzip2. Use rzip. It compresses better than bzip2 (and is slightly slower).

sbc

sbc c -m3 -b63 linux-2.6.25.tar.sbc linux-2.6.25.tar

-------------------------------------------------------------------------------
<>
-------------------------------------------------------------------------------
Creating archive: "linux-2.6.25.tar.sbc"...
Searching files...
Archive encryption: none
Sorting files...Done (time: 0.0 seconds)...
Compressing, method: advanced, blocks: 32.0 MB/analysis+name, mem.: 226.1 MB.
Compressing...

    linux-2.6.25.tar [blk 0000, 0.00 bpB, 0.0%]
    linux-2.6.25.tar [blk 0001, 1.74 bpB, 1.6%]
    linux-2.6.25.tar [blk 0002, 1.76 bpB, 1.7%]
    linux-2.6.25.tar [blk 0003, 1.78 bpB, 1.8%]
    linux-2.6.25.tar [blk 0004, 1.78 bpB, 3.6%]
    linux-2.6.25.tar [blk 0005, 1.18 bpB, 15.4%]
    linux-2.6.25.tar [blk 0006, 1.19 bpB, 27.2%]
    linux-2.6.25.tar [blk 0007, 1.18 bpB, 39.0%]
    linux-2.6.25.tar [blk 0008, 1.20 bpB, 50.7%]
    linux-2.6.25.tar [blk 0009, 1.20 bpB, 62.5%]
    linux-2.6.25.tar [blk 0010, 1.20 bpB, 63.1%]
    linux-2.6.25.tar [blk 0011, 1.19 bpB, 74.9%]
    linux-2.6.25.tar [blk 0012, 1.16 bpB, 86.7%]
    linux-2.6.25.tar [blk 0013, 1.17 bpB, 94.5%]
    linux-2.6.25.tar [blk 0014, 1.16 bpB, 99.4%]
    linux-2.6.25.tar [blk 0015, 1.16 bpB, 100.0%] 

  * Successfully compressed 284,651,520 into 41,322,279 (14.5%) bytes.
  * Compressor: 1.161 bpB, 1571.84 kB/s, 176.85 seconds.

Filesize: 41322279 bytes (39.4 MiB) (14.52%)
Time: 2m56.239s

Pretty good result, similar to 7-zip’s.

7za (7-zip)

7za a -mx=9 linux-2.6.25.tar.7z linux-2.6.25.tar

7-Zip (A) 4.57  Copyright (c) 1999-2007 Igor Pavlov  2007-12-06
p7zip Version 4.57 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)
Scanning

Creating archive linux-2.6.25.tar.7z

Compressing  linux-2.6.25.tar

Everything is Ok

Filesize: 39999865 bytes (38.1 MiB) (14.05%)
Time: 2m55.804s

Best general purpose compressor with a very good compression ratio.

lpaq8

lpaq8 7 linux-2.6.25.tar linux-2.6.25.tar.lpaq8

284651520 -> 29938803 in 947.320 sec. using 390 MB memory

Filesize: 29938803 bytes (28.6 MiB) (10.52%)
Time: 15m45.911s

Compresses a good 10 MB better than 7-Zip. But takes around 5 times longer.

paq8o6_sse (paq8o6)

paq8o6_sse -7 linux-2.6.25.tar

Creating archive linux-2.6.25.tar.paq8o6 with 1 file(s)...
linux-2.6.25.tar 284651520 -> PGM 225x289 25253916
284651520 -> 25253916
Time 500.56 sec, used 873327891 bytes of memory

Filesize: 25253916 bytes (24.1 MiB) (8.87%)
Time: 365m57.112s

Holy s***. 8.87%. Compresses 4 MB better than lpaq8, but takes 23 times longer (6 whole hours).

Conclusion

Use 7-zip. If you want faster compression, use rzip. And if you’ve got a supercomputer, use lpaq8. If you’ve got 23 supercomputers and can hack paq to use SMP, use paq. Please don’t use gzip or bzip2.

Filesize vs. Time graph

3 Responses to “Comparison of Data Compression Programs on linux-2.6.25.tar”

  1. Tel Says:

    It still makes a lot of sense to use gzip and bzip2 because they are widely accepted standards and well established programs, debugged, tested and supported by tar, rpm, dpkg, etc. Telling people “please don’t use” is just silly.

    To quote from rzip help “note that rzip cannot operate on stdin/stdout” which means that it cannot do the jobs that bzip2 does. Again, telling people not to use bzip2 is just plain dumb when your suggested replacement is lacking fundamental features.

    Also, you tested one single file as if this was the last word in compression testing. Other files will give different results. bzip2 will always beat rzip in both speed and compression ratio when compressing PGM image data if the number of colours is low (e.g. scanned documents). Whereas rzip gives better compression for source code because files often have common headers.

    rzip sometimes compresses worse at “-9″ than it does at “-6″ as well as being slower, thus to compress to best effect you need to wrap it in a script to test which option gives the best result (as above, scanned PGM documents with reduced colours will demonstrate this).

    You also completely ignore the time to decompress. gzip has a very small and simple decompressor which makes it great for bootstraps and embedded systems. rzip and bzip2 give approximately equal compression performance on executables, but while bzip2 is slower to compress, rzip is slower to decompress. An executable might get compressed once, but decompressed many times over so the speed of decompression is a bit more important. 7-zip beats nearly everything in the combination of good compression ratio and fast decompression but 7-zip is difficult to script with because it works a bit like tar (but missing the full features of tar) and a bit like a compressor.

  2. wj32 Says:

    What I meant was for personal use (backups, etc). And yes, everything you say is true. However, I did not mean for this to be “the last word in compression testing”. This was only meant to be a small test of a few compressors, and the real point in doing this was to compare PAQ to the other compressors (PAQ compresses way too slowly for any real work). I was trying to emphasize how slow PAQ is. I threw in the other compressors at the end.

    I also do not have the time to do rigorous testing. Again, this is not a “serious” test.

  3. Phillip J. San Says:

    Interesting…I didn’t know even that some of these options existed really I will take into account some of the interesting parts of what was said. Even a really dog slow method might just save my bacon at some point.

    Kind of makes me thing even with external storage becoming so cheap.

    Maybe if you could include a comparison rar

Leave a Reply