Go Beyond

Only read if you don't mind being offended.


Dynamic Compression Levels

When I worked at Rackspace we had some issues with image backup of servers. Seemed to manifest on the OpenStack VPS deployments the most.

A snapshot was taken of the VM's VHD (Virtual Hard Disk, kind of a strange disk image format) and it was gzipped and uploaded to Cloud Files. This operation times a lot of servers and it was both a ton of internal bandwidth and strain on the Cloud Files infrastructure. It worked pretty well for the most part, but there was some room for improvement.

If I recall correctly, a change between the "legacy" (Slicehost) codebase and Openstack was that the Slicehost codebase had gzip compression set at the default, -6. OpenStack had it at -9.

One of my first OpenStack patches (there weren't very many) was making the gzip level configurable. I barely knew Python at all and it took me quite a while to get it in and approved. I think my patch broke a few things that other people had to fix. Whoops!

I think we reverted it back to -6. What was happening was backups were way slower as on -9, the CPU was pegged on gzip. Pretty painfully so. This means the backup speed was not optimal at all as it was CPU-bound. Further, I believe gzip is singlethreaded, or at least was back then. I could be wrong on that, but would certainly explain some of the problem. In the end, going to -6 gave us much faster backups that could finish in a timely manner.

It still wasn't quite optimal, though, and I've yet to see someone make the optimal solution.

A few words on compression

I have never written a compression algorithm. I certainly don't understand most of what goes into it. I just know a few traits and have certainly compressed and decompressed my fair share of files.

Compression algorithms work best in the middle range. This is considering time/efficiency tradeoffs. Most of the time, the defaults are best. But if you have an archive that millions of people will download, going full -9er on it isn't a bad idea. But don't expect huge gains over -6. And if you want it to be fast, odds are, -1 isn't tons faster and you're better off with a faster algorithm.

Common algorithms

It's usually a good practice to have high, medium, and low settings and start in the middle as a rule of thumb.

  • LZO2: This one is super fast to compress and decompress. I generally consider this a "line speed" algorithm, able to keep up with as much network throughput as you can throw at it.
  • GZIP: This is a very good middle ground. Decompression is fast and compression isn't painful.
  • XZ/LZMA: When storage, and/or bandwidth is expensive, this is your best bet. It's slow on compression but decompression really isn't that bad. It's also multi-threaded, at least in the xz-utils variant.

There's one notable mention.

  • BZIP2: It's slower than XZ to compress, doesn't compress as well, and is absurdly slow on decompression. Please don't use this. Either move to gzip or XZ. There's almost no reason to use BZIP2 anymore. It was understandable ten years ago. Not anymore.

Dynamic compression

Now this is theoretical, but what if you weren't sure what your line speed was going to be and you wanted to compress something as fast as possible? If you use too "quick" of a compression, network is a bottleneck. Too "slow" of a compression, CPU is a bottleneck.

pv /dev/sda | xz -9c | ssh root@somehost > /root/sda.xz

How do you know how much to compress to balance it out? This is actually extremely simple. Your metric for speed is input read speed. So just like in the example above, you want to read the speed in which you are processing the source file. Because the rest of the system will slow it down or speed it up accordingly. From there, vary your compression level dynamically. Test low compression, high compression, and then refine from there, maybe in five second intervals. Re-evaluate every so often after the base rate is set in case connection speed changes or system load. Determining in the pipeline how fast you read the file determines how fast you'll be done sending it up.

Now while the dynamic compression probing bit is simple, actually making your compressor dynamic and sensible is another matter. The devil is in the details with that. And in somoe cases, it might be nice to be able to use anything from gzip, to lzo2, to xz in the pipe. Which the mix would undoubtedly lower the overall compression rate, but could be faster than ranging from -1 to -9 in any one of those algorithms.

I would love to see someone implement this as I think it's an optimal compromise system for a difficult problem that's still somewhat simple and sensible. Most likely just with a single algorithm. Obviously, this is not what you use for one-and-done compression where taking your time isn't a big deal or you want a lot of consistency. It might be applicable to video transcoding as well, in some cases. But messing with your compression level too much and too quickly, likely, is going to have some weird results and be suboptimal.

Anyway, until next time. Thanks for reading!