Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Is there a way to change compression level in Gzip in GzipCodec ?

avatar
Guru

Looking at the code, it looks like GZipCodec uses Deflater.DEFAULT_COMPRESSION. Is there a way to tweak compression levels of Gzip for mapreduce output?

1 ACCEPTED SOLUTION

avatar
Guru

Going through the code, I found the way to set GzipCodec compression level.

Right now, GzipCodec supports BEST_COMPRESSION, BEST_SPEED, NO_COMPRESSION and DEFAULT. Gzip itself supports 1-9 compression levels. But GzipCodec can use only BEST_COMPRESSION(9), BEST_SPEED(1) and DEFAULT (6). You can set them by passing zlib.compress.level = BEST_SPEED or BEST_COMRPESSION.

However, looking at numbers in our tests, a compression level of 4 seems to be best compression per CPU time. This is not possible right now to set level 4.

P.S. HDP 2.4 onwards, you can add other compression levels like 4. https://issues.apache.org/jira/browse/HADOOP-12794 has more details.

View solution in original post

6 REPLIES 6

avatar

Looking at HADOOP-5879 it seems you can.

avatar
Master Mentor

@ravi@hortonworks.com

Ravi, This is interesting find

avatar
Guru

Interesting find. It is still not clear how we can send this configuration for mapreduce output. Right now, we just specify Codec.

avatar
Master Mentor

I agree. I am trying to reach out to the contributors on that thread. I found this article

Let me know if that helps to being with the customization on compression.

@ravi@hortonworks.com

avatar
Guru

Going through the code, I found the way to set GzipCodec compression level.

Right now, GzipCodec supports BEST_COMPRESSION, BEST_SPEED, NO_COMPRESSION and DEFAULT. Gzip itself supports 1-9 compression levels. But GzipCodec can use only BEST_COMPRESSION(9), BEST_SPEED(1) and DEFAULT (6). You can set them by passing zlib.compress.level = BEST_SPEED or BEST_COMRPESSION.

However, looking at numbers in our tests, a compression level of 4 seems to be best compression per CPU time. This is not possible right now to set level 4.

P.S. HDP 2.4 onwards, you can add other compression levels like 4. https://issues.apache.org/jira/browse/HADOOP-12794 has more details.

avatar
Master Mentor