Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Double read for lzma compression codec

Double read for lzma compression codec

Rising Star

Hi experts!

 

i will very appreciate if somebody could help me to understand - why i read in two rimes more data with lzma codec (https://github.com/yongtang/hadoop-xz).

I've started to play with this and met interesting thing. When I try to proceed data with lzma i read in two times more data then i'm actually have on the HDFS.

For example, hadoop client (hadoop fs -du) shows some numbers like 100GB.

then i run MR (like select count(1) ) over this data and check MR counters and find "HDFS bytes read" two times more (like 200GB).

In case of gzip and bzip2 codecs hadoop client file size and MR counters are the similar

 

1 REPLY 1
Highlighted

Re: Double read for lzma compression codec

Super Collaborator

The codec is responsible for the reads and you will need to talk to the creator of the codec to provide you with the information on why this is happening.

 

Wilfred