Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

avatar
Explorer
 
1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
8 REPLIES 8

avatar
Master Mentor

@Issaq Mohammad If replication factor is not 1 then data will be distributed across the different nodes. See the following details :

4131-name-node.png

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

Here is a great writeup on file compression in Hadoop - http://comphadoop.weebly.com/

avatar
Explorer

Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.

avatar
Master Guru

Not exactly sure what you try to say with "codec mechanism" But if you mean if you could transform a single big GZ file into small gz files or into uncompressed files you would most likely use pig:

http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-p...

To specify a number of writers you will need to force reducers.

http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-in...

And here are some tips on setting the number of reducers:

http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features

Instead of pig you could also write a small MapReduce job, here you are more flexible for the price of a bit of coding. Or Spark might work too. Or Hive using the DISTRIBUTE BY keyword.

avatar
Explorer

Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.

avatar
Super Collaborator

You can achieve this by reading "non-splittable" compressed format in Single Mapper and then distributing data using Reducer to multiple nodes.

HDFS will store data on multiple node even if files are compressed (using non-splittable or splittable codec) .HDFS will split the compressed file based on the block size. While reading file back in a MR job , your MR job will have a single mapper if your file is compressed using non-splittable codec otherwise (splittable codec) MR Job will have multiple mapper to read data.

How Data is distributed :

Suppose you have 1024MB of compressed file and your Hadoop cluster have 128MB of block size.

When you upload the compressed file to HDFS , it will get converted into 8blocks (128MB each block size) and distributed to different nodes of cluster. HDFS would take care about which node should receive block in a cluster depending on cluster health/ node health/ HDFS balance.

avatar
Contributor