Created on 08-20-2015 12:59 PM
There are several places throughout Hadoop and the various tools you're using where compression can be enabled:
In your platform-wide configuration files you can enable compression for Hive, MapReduce programs, and others, and those programs or queries will emit their new output datasets in a compressed form. A MapReduce program can also turn on or off compression explicitly during its own operation. Sqoop, for example, makes it easy to enable compression with the --compress flag, which will use GZip compression. Subsequent datasets you import into HDFS via Sqoop will be compressed.
For a file to be compressed in HDFS, the process that writes it to HDFS has to compress the data. HDFS itself does not compress or uncompress files.
You can begin compressing new data without worrying about needing to recompress existing data. Enabling compression in any of the tasks that write to HDFS affects only the new data being written. Existing files are not affected. After a file has been written to HDFS, it will never change in any way. It doesn't matter to HDFS whether the data is compressed or not.
Intermingling uncompressed text and GZip-compressed text files in the same Hive table works transparently.
Hive has parameters to enable compression of intermediate (hive.exec.compress.intermediate) and output data (hive.exec.compress.output).
On the reading side, Hive automatically recognizes that input files ending with .gz are compressed. It reads this data the same not compressed data. It does this regardless of whether or not you write your new outputs in a compressed form. You can intermix GZip-compressed files and plain text files in the same table and Hive is able to read them.
To see compressed files use:
The alternative -cat is a literal printing of the uninterpreted bytes of a file to your terminal, so it won't handle compressed data. Using -text will decode a few file formats. In particular, GZip-compressed text is decompressed, and SequenceFiles is decoded into textual key/value pairs. You can always pipe the data through the particular decompression you need:
There are multiple parameters that control compression:
These parameters can either be specified in the configuration files or on the command line. There is also programmatic control from Java.
Compression in Sqoop refers to the output files being written. Enabling compression does not affect the protocol in use between Java and the database itself, so the connections between the database and the initial set of Sqoop writer processes will be uncompressed. The connection between those writer processes and their additional HDFS replicas will be compressed. This may increase your performance but the real intent is to save HDFS space.
In Sqoop, use the --compress flag. This enables GZip compression for a single run of Sqoop. Subsequent datasets you import into HDFS via Sqoop will be compressed only if you continue to include the --compress flag.
Many compression algorithms do not produce files that can be split among MapReduce tasks. For example, a GZip-compressed file cannot be split. To work around this behavior, consider using parallelism to read multiple smaller files into HDFS, such as when using Sqoop. Each writer in Sqoop will create a separate file. So if you use eight threads, then Sqoop creates eight files, which allow eight subsequent map tasks in another MapReduce job (or Hive query). If you're merging periodic imports over the course of many days, then you'll eventually have plenty of files for parallelism purposes anyway.
NOTE: This article was taken from our internal knowledge base. To view the original article, click this link (requires login):