Compressing Data in Hadoop

By Dirk deRoos

The huge data volumes that are realities in a typical Hadoop deployment make compression a necessity. Data compression definitely saves you a great deal of storage space and is sure to speed up the movement of that data throughout your cluster. Not surprisingly, a number of available compression schemes, called codecs, are out there for you to consider.

In a Hadoop deployment, you’re dealing (potentially) with quite a large number of individual slave nodes, each of which has a number of large disk drives. It’s not uncommon for an individual slave node to have upwards of 45TB of raw storage space available for HDFS.

Even though Hadoop slave nodes are designed to be inexpensive, they’re not free, and with large volumes of data that have a tendency to grow at increasing rates, compression is an obvious tool to control extreme data volumes.

First, some basic terms: A codec, which is a shortened form of compressor/decompressor, is technology (software or hardware, or both) for compressing and decompressing data; it’s the implementation of a compression/decompression algorithm.

You need to know that some codecs support something called splittable compression and that codecs differ in both the speed with which they can compress and decompress data and the degree to which they can compress it.

Splittable compression is an important concept in a Hadoop context. The way Hadoop works is that files are split if they’re larger than the file’s block size setting, and individual file splits can be processed in parallel by different mappers.

With most codecs, text file splits cannot be decompressed independently of other splits from the same file, so those codecs are said to be non-splittable, so MapReduce processing is limited to a single mapper.

Because the file can be decompressed only as a whole, and not as individual parts based on splits, there can be no parallel processing of such a file, and performance might take a huge hit as a job waits for a single mapper to process multiple data blocks that can’t be decompressed independently.

Splittable compression is only a factor for text files. For binary files, Hadoop compression codecs compress data within a binary-encoded container, depending on the file type (for example, a SequenceFile, Avro, or ProtocolBuffer).

Speaking of performance, there’s a cost (in terms of processing resources and time) associated with compressing the data that is being written to your Hadoop cluster.

With computers, as with life, nothing is free. When compressing data, you’re exchanging processing cycles for disk space. And when that data is being read, there’s a cost associated with decompressing the data as well. Be sure to weigh the advantages of storage savings against the additional performance overhead.

If the input file to a MapReduce job contains compressed data, the time that is needed to read that data from HDFS is reduced and job performance is enhanced. The input data is decompressed automatically when it is being read by MapReduce.

The input filename extension determines which supported codec is used to automatically decompress the data. For example, a .gz extension identifies the file as a gzip-compressed file.

It can also be useful to compress the intermediate output of the map phase in the MapReduce processing flow. Because map function output is written to disk and shipped across the network to the reduce tasks, compressing the output can result in significant performance improvements.

And if you want to store the MapReduce output as history files for future use, compressing this data can significantly reduce the amount of needed space in HDFS.

There are many different compression algorithms and tools, and their characteristics and strengths vary. The most common trade-off is between compression ratios (the degree to which a file is compressed) and compress/decompress speeds. The Hadoop framework supports several codecs. The framework transparently compresses and decompresses most input and output file formats.

The following list identifies some common codecs that are supported by the Hadoop framework. Be sure to choose the codec that most closely matches the demands of your particular use case (for example, with workloads where the speed of processing is important, chose a codec with high decompression speeds):

  • Gzip: A compression utility that was adopted by the GNU project, Gzip (short for GNU zip) generates compressed files that have a .gz extension. You can use the gunzip command to decompress files that were created by a number of compression utilities, including Gzip.

  • Bzip2: From a usability standpoint, Bzip2 and Gzip are similar. Bzip2 generates a better compression ratio than does Gzip, but it’s much slower. In fact, Of all the available compression codecs in Hadoop, Bzip2 is by far the slowest.

    If you’re setting up an archive that you’ll rarely need to query and space is at a high premium, then maybe would Bzip2 be worth considering.

  • Snappy: The Snappy codec from Google provides modest compression ratios, but fast compression and decompression speeds. (In fact, it has the fastest decompression speeds, which makes it highly desirable for data sets that are likely to be queried often.)

    The Snappy codec is integrated into Hadoop Common, a set of common utilities that supports other Hadoop subprojects. You can use Snappy as an add-on for more recent versions of Hadoop that do not yet provide Snappy codec support.

  • LZO: Similar to Snappy, LZO (short for Lempel-Ziv-Oberhumer, the trio of computer scientists who came up with the algorithm) provides modest compression ratios, but fast compression and decompression speeds. LZO is licensed under the GNU Public License (GPL).

    LZO supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs. LZO needs to create an index when it compresses a file, because with variable-length compression blocks, an index is required to tell the mapper where it can safely split the compressed file. LZO is only really desirable if you need to compress text files.

Hadoop Codecs
Codec File Extension Splittable? Degree of Compression Compression Speed
Gzip .gz No Medium Medium
Bzip2 .bz2 Yes High Slow
Snappy .snappy No Medium Fast
LZO .lzo No, unless indexed Medium Fast

All compression algorithms must make trade-offs between the degree of compression and the speed of compression that they can achieve. The codecs that are listed provide you with some control over what the balance between the compression ratio and speed should be at compression time.

For example, Gzip lets you regulate the speed of compression by specifying a negative integer (or keyword), where –1 indicates the fastest compression level, and –9 indicates the slowest compression level. The default compression level is –6.