Compression FAQ

by Community Manager on ‎08-20-2015 12:59 PM

What does it mean to "enable compression" in HDFS?

 

There are several places throughout Hadoop and the various tools you're using where compression can be enabled:

  • Hadoop MapReduce tasks
  • Sqoop output
  • Hive tasks and access to tables

In your platform-wide configuration files you can enable compression for Hive, MapReduce programs, and others, and those programs or queries will emit their new output datasets in a compressed form. A MapReduce program can also turn on or off compression explicitly during its own operation. Sqoop, for example, makes it easy to enable compression with the --compress flag, which will use GZip compression. Subsequent datasets you import into HDFS via Sqoop will be compressed.

 

What does it mean for files to be compressed in HDFS?

 

For a file to be compressed in HDFS, the process that writes it to HDFS has to compress the data. HDFS itself does not compress or uncompress files.

 

Can I combine compressed and not compressed files in HDFS? In Hive?

 

You can begin compressing new data without worrying about needing to recompress existing data. Enabling compression in any of the tasks that write to HDFS affects only the new data being written. Existing files are not affected. After a file has been written to HDFS, it will never change in any way. It doesn't matter to HDFS whether the data is compressed or not.

Intermingling uncompressed text and GZip-compressed text files in the same Hive table works transparently.

 

How does Hive handle compressed data?

 

Hive has parameters to enable compression of intermediate (hive.exec.compress.intermediate) and output data (hive.exec.compress.output).

On the reading side, Hive automatically recognizes that input files ending with .gz are compressed. It reads this data the same not compressed data. It does this regardless of whether or not you write your new outputs in a compressed form. You can intermix GZip-compressed files and plain text files in the same table and Hive is able to read them.

 

How do I look at compressed data?

 

To see compressed files use:

hadoop fs -text

The alternative -cat is a literal printing of the uninterpreted bytes of a file to your terminal, so it won't handle compressed data. Using -text will decode a few file formats. In particular, GZip-compressed text is decompressed, and SequenceFiles is decoded into textual key/value pairs. You can always pipe the data through the particular decompression you need:

hadoop fs -cat foo.some.strange.format | strange-format-decompressor

How do I compress MapReduce data?

 

There are multiple parameters that control compression:

  • mapred.compress.map.output controls compression of the intermediate results
  • mapred.compress.output controls compression of the final output
  • mapred.output.compression.type controls compression type of the sequence files

These parameters can either be specified in the configuration files or on the command line. There is also programmatic control from Java.

 

How do I enable Sqoop to compress data?

 

Compression in Sqoop refers to the output files being written. Enabling compression does not affect the protocol in use between Java and the database itself, so the connections between the database and the initial set of Sqoop writer processes will be uncompressed. The connection between those writer processes and their additional HDFS replicas will be compressed. This may increase your performance but the real intent is to save HDFS space.

In Sqoop, use the --compress flag. This enables GZip compression for a single run of Sqoop. Subsequent datasets you import into HDFS via Sqoop will be compressed only if you continue to include the --compress flag.

 

Can compressed files be split among multiple MapReduce readers?

 

Many compression algorithms do not produce files that can be split among MapReduce tasks. For example, a GZip-compressed file cannot be split. To work around this behavior, consider using parallelism to read multiple smaller files into HDFS, such as when using Sqoop. Each writer in Sqoop will create a separate file. So if you use eight threads, then Sqoop creates eight files, which allow eight subsequent map tasks in another MapReduce job (or Hive query). If you're merging periodic imports over the course of many days, then you'll eventually have plenty of files for parallelism purposes anyway.

 

 

NOTE: This article was taken from our internal knowledge base.  To view the original article, click this link (requires login):

 

Compression FAQ

Contributors
Disclaimer: The information contained in this article was generated by third-parties and not by Cloudera or it's personnel. Cloudera cannot guarantee its accuracy or efficacy. Cloudera disclaims all warranties of any kind and users of this information assume all risk associated with it and with following the advice or directions contained herein. By visiting this page, you agree to be bound by the Terms and Conditions of Site Usage , including all disclaimers and limitations contained therein.