Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

how to compress the hdfs data using zlib compression ??

avatar
Super Collaborator

Hi Community team,

Any one can you help me how to enable zlib compression in hdp.2.4.2.

Thanks in advance

1 ACCEPTED SOLUTION

avatar
Super Guru

@subhash parise

As @Artem Ervits shared, you get compression when storing your data in ORC format. However, if you want to store "raw" data on HDFS and you want to selectively compress it, you can use a simple PIG script to do it. Load the data from HDFS and then write it out again.

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

inputFiles = LOAD '/input/directory/uncompressed' using PigStorage();
STORE inputFiles INTO '/output/directory/compressed/' USING PigStorage();

You can either leave the uncompressed data or remove it, depending on what you are doing. This is an approach that I've used.

You can use different codecs depending on your needs:

set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

View solution in original post

9 REPLIES 9

avatar
Master Mentor

Take a look at this article, it has ways of setting compression, including zlib in Hive. http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

It will help if you specify which product specifically you're trying to enable zlib for. Since you categorized the question in data ingestion, I will assume it's for Sqoop, here's an example how to Sqoop using compression, just replace snappy codec class with zlib https://community.hortonworks.com/questions/29648/sqoop-import-to-hive-with-compression.html

avatar
Super Collaborator
@Artem Ervits

:Thank you for replying my question . I am looking for zlib compression in hdfs level.

avatar
Super Guru

@subhash parise

As @Artem Ervits shared, you get compression when storing your data in ORC format. However, if you want to store "raw" data on HDFS and you want to selectively compress it, you can use a simple PIG script to do it. Load the data from HDFS and then write it out again.

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

inputFiles = LOAD '/input/directory/uncompressed' using PigStorage();
STORE inputFiles INTO '/output/directory/compressed/' USING PigStorage();

You can either leave the uncompressed data or remove it, depending on what you are doing. This is an approach that I've used.

You can use different codecs depending on your needs:

set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

avatar
Super Collaborator

@Michael Young: could you please give me the syntax to set compression codec for zlib codec ??

avatar
Super Guru

@subhash parise

The default codec is zlib. If you want to explicitly set it to zlib, use the following:

set output.compression.codec org.apache.hadoop.io.compress.DefaultCodec;

avatar
Super Guru

@subhash parise

I just posted an article demonstrating a very simple Pig + Hive example showing HDFS compression.

https://community.hortonworks.com/content/kbentry/50921/using-pig-to-convert-uncompressed-data-to-co...

avatar

avatar

Sample hive script:

CREATE EXTERNAL TABLE test.temp3

(

cat_0 bigint,

cat_1 bigint,

cat_2 bigint,

cat_3 bigint,

cat_4 bigint,

cat_5 bigint,

cat_6 bigint,

cat_7 bigint,

cat_8 bigint,

cat_9 bigint

)

row format delimited fields terminated by ','

stored as ORC location '/test/'

tblproperties ("orc.compress"="ZLIB");

avatar
Super Collaborator

@Divakar Annapureddy: Thank you for replying my question. my case is a bit different. i need zlib codec for hdfs data(hadoop files)