Support Questions

Find answers, ask questions, and share your expertise

How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

avatar
Explorer
 
1 ACCEPTED SOLUTION

avatar
Master Guru

In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together).

This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start.

So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB.

There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields.

View solution in original post

8 REPLIES 8

avatar
Master Mentor

@Issaq Mohammad If replication factor is not 1 then data will be distributed across the different nodes. See the following details :

4131-name-node.png

avatar
Master Guru

In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together).

This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start.

So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB.

There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields.

avatar

Here is a great writeup on file compression in Hadoop - http://comphadoop.weebly.com/

avatar
Explorer

Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.

avatar
Master Guru

Not exactly sure what you try to say with "codec mechanism" But if you mean if you could transform a single big GZ file into small gz files or into uncompressed files you would most likely use pig:

http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-p...

To specify a number of writers you will need to force reducers.

http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-in...

And here are some tips on setting the number of reducers:

http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features

Instead of pig you could also write a small MapReduce job, here you are more flexible for the price of a bit of coding. Or Spark might work too. Or Hive using the DISTRIBUTE BY keyword.

avatar
Explorer

Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.

avatar
Super Collaborator

You can achieve this by reading "non-splittable" compressed format in Single Mapper and then distributing data using Reducer to multiple nodes.

HDFS will store data on multiple node even if files are compressed (using non-splittable or splittable codec) .HDFS will split the compressed file based on the block size. While reading file back in a MR job , your MR job will have a single mapper if your file is compressed using non-splittable codec otherwise (splittable codec) MR Job will have multiple mapper to read data.

How Data is distributed :

Suppose you have 1024MB of compressed file and your Hadoop cluster have 128MB of block size.

When you upload the compressed file to HDFS , it will get converted into 8blocks (128MB each block size) and distributed to different nodes of cluster. HDFS would take care about which node should receive block in a cluster depending on cluster health/ node health/ HDFS balance.

avatar
Contributor