When thinking about the compression, people usually think of zip, rar, bzip2, gzip and the only concern is the compression ratio.

Recently, I had to make the decision to pick a proper compressed file system for a Python project I was working on. Before that, I have used zipfile and tarfile module to do the same task, but turns out they are not good enough of this one. Even though the ratio is not bad, the performance is too slow in both compression and decompression, also another concern is the resource consuming (CPU time + memory).

Take a look into current status of this scene, the last 10 years produces more effective compression alrogithms then you can count (as they claimed), maybe because the rapid growing of cloud services/storage. Here are the few names:

They are re-invented to fit in the needs for each purposes. Some are good balance in ratio and compression speed or super quick in decompression. Some are good for specific type of data like binary set, text json, fragment data, …

Im not an expert in this field, only the number can give you the best insight and advice. Check these links:

They are clearly implemented in C for fair comparision, only a few were ported into Python by some random guys (not official in another word). Back to this project, I’m not trying to do anything new, just a quick benchmark of these public modules for making decision.

👊👊👊 Source code: PyCompressTest

I have checked all the working implementation of these algorithms in Python - detail list, and compare them in these terms:

  • Compression ratio
  • Compression speed
  • Decompression speed
  • Memory usage during compressing/decompressing

The following results are obtained using 1 core of Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz, 7.77 GiB RAM, Ubuntu 18.04 64 bit, with silesia.tar which contains tarred files from Silesia compression corpus.

👉 Full charts with memory usage - check out this link