Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop input split for a compressed block

Highlighted

Hadoop input split for a compressed block

New Contributor

If i have a compressed file of 1GB which is splittable and by default the block size and input split size is 128MB then there are 8 blocks created and 8 input split. When the compressed block is read by map reduce it is uncompressed and say after uncompression the size of the block bbecomes 200MB. But the input split for this assigned is of 128MB, so how is the rest of the 82MB processed.

 

1. Is it processed by the next input split?

2. Is the same input split size is increased?

1 REPLY 1

Re: Hadoop input split for a compressed block

Master Guru
The input split size is just a unit of work division. A task is not bound by any limits preventing it to read more/less.

I/O-wise you are still reading 128 MB, but are expanding those bytes in-memory when decompressing.

Among your choices, (2) is closer to what the concept is.