Created on 10-28-2015 04:39 PM - edited 09-16-2022 02:46 AM
Looking at the code, it looks like GZipCodec uses Deflater.DEFAULT_COMPRESSION. Is there a way to tweak compression levels of Gzip for mapreduce output?
Created 11-02-2015 04:24 AM
Going through the code, I found the way to set GzipCodec compression level.
Right now, GzipCodec supports BEST_COMPRESSION, BEST_SPEED, NO_COMPRESSION and DEFAULT. Gzip itself supports 1-9 compression levels. But GzipCodec can use only BEST_COMPRESSION(9), BEST_SPEED(1) and DEFAULT (6). You can set them by passing zlib.compress.level = BEST_SPEED or BEST_COMRPESSION.
However, looking at numbers in our tests, a compression level of 4 seems to be best compression per CPU time. This is not possible right now to set level 4.
P.S. HDP 2.4 onwards, you can add other compression levels like 4. https://issues.apache.org/jira/browse/HADOOP-12794 has more details.
Created 10-28-2015 09:38 PM
Looking at HADOOP-5879 it seems you can.
Created 10-28-2015 10:36 PM
@ravi@hortonworks.com
Ravi, This is interesting find
Created 10-28-2015 10:48 PM
Interesting find. It is still not clear how we can send this configuration for mapreduce output. Right now, we just specify Codec.
Created 10-29-2015 10:19 AM
I agree. I am trying to reach out to the contributors on that thread. I found this article
Let me know if that helps to being with the customization on compression.
@ravi@hortonworks.com
Created 11-02-2015 04:24 AM
Going through the code, I found the way to set GzipCodec compression level.
Right now, GzipCodec supports BEST_COMPRESSION, BEST_SPEED, NO_COMPRESSION and DEFAULT. Gzip itself supports 1-9 compression levels. But GzipCodec can use only BEST_COMPRESSION(9), BEST_SPEED(1) and DEFAULT (6). You can set them by passing zlib.compress.level = BEST_SPEED or BEST_COMRPESSION.
However, looking at numbers in our tests, a compression level of 4 seems to be best compression per CPU time. This is not possible right now to set level 4.
P.S. HDP 2.4 onwards, you can add other compression levels like 4. https://issues.apache.org/jira/browse/HADOOP-12794 has more details.
Created 11-02-2015 11:58 AM
Thanks @ravi@hortonworks.com