Support Questions

Find answers, ask questions, and share your expertise

HDFS Compression along with mapreduce codec a good idea?Any disadvantage of doing this ?

avatar
Contributor
 
1 ACCEPTED SOLUTION

avatar
Guru

Not sure of your exact question, but typically it is a good idea to compress the output of your map step in map-reduce jobs. This is because this data is written to disk and then sent within your cluster to the reducer (shuffle) and the overhead of compressing/decompressing is almost always minimal compared to the large gains from sending over the wire significantly lower data volumes from compressed data.

To set this for all of your jobs, use these configs in mapred-site.xml"

<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
</property> 
 
<property> 
  <name>mapred.map.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value> 
</property> 

You can of course set the first value to false in mapred-site.xml and override it by setting it for each job (e.g. as a parameter in the command line or set at the top of a pig script).

See this link for details: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html

View solution in original post

1 REPLY 1

avatar
Guru

Not sure of your exact question, but typically it is a good idea to compress the output of your map step in map-reduce jobs. This is because this data is written to disk and then sent within your cluster to the reducer (shuffle) and the overhead of compressing/decompressing is almost always minimal compared to the large gains from sending over the wire significantly lower data volumes from compressed data.

To set this for all of your jobs, use these configs in mapred-site.xml"

<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
</property> 
 
<property> 
  <name>mapred.map.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value> 
</property> 

You can of course set the first value to false in mapred-site.xml and override it by setting it for each job (e.g. as a parameter in the command line or set at the top of a pig script).

See this link for details: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html