Python gzip benchmark
I was faced with the simple task of gzipping a file in a server. “Shell tools are always more efficient”, I thought…
I was faced with the simple task of gzipping a file in a server after my Python program has finished processing it. “Shell tools are always more efficient”, I thought, so I immediately used subprocess
to call the gzip subprocess (my application always runs in Linux, so compatibility is not an issue).
But are they? I decided to look up benchmarks, and only found a (now-archived) benchmark of decompressing a gzipped file. I found no comparison between the gzip command and Python’s gzip module. So I made my own.
First let’s create a file with 1GB of sequential numbers (random data doesn’t compress very well so it would be a bad test):
import time
start = time.time()
with open('megafile', 'w') as f:
for i in range(128 * 1024 * 1024):
f.write(f'{i % 100000000:08d}\n')
print(f'Generated file in {time.time() - start:.1f}s.')
# Generated file in 69.8s.
Now let’s compress it with the shell command (using -k
to keep the original file):
$ time gzip -1 -f -k megafile && du -h megafile.gz
real 0m11.061s
user 0m10.438s
sys 0m0.625s
283M megafile.gz
$ time gzip -9 -f -k megafile && du -h megafile.gz
real 0m56.707s
user 0m56.125s
sys 0m0.453s
273M megafile.gz
$ time python3 -c 'import gzip, shutil
with open("megafile", "rb") as fin, \
gzip.open("megafile.gz", "wb", compresslevel=1) as fout:
shutil.copyfileobj(fin, fout)' && du -h megafile.gz
real 0m8.976s
user 0m8.344s
sys 0m0.625s
283M megafile.gz
$ time python3 -c 'import gzip, shutil
with open("megafile", "rb") as fin, \
gzip.open("megafile.gz", "wb", compresslevel=9) as fout:
shutil.copyfileobj(fin, fout)' && du -h megafile.gz
real 0m56.796s
user 0m56.063s
sys 0m0.656s
271M megafile.gz
My conclusion: There is basically no performance difference between the shell’s gzip command and Python’s gzip module!
Also, it surprised me quite a lot that the compression level made very little difference in the output file size, but a massive difference in the time it took to run, but that can be this specific file’s profile.
If you decide to run this test with a new file or a different one, please post your results in the comments. Thanks!