question Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node? in Archives of Support Questions (Read Only)

How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

issaq_mohd — Tue, 10 May 2016 18:13:03 GMT

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

nsabharwal — Mon, 19 Aug 2019 08:23:11 GMT

@Issaq Mohammad If replication factor is not 1 then data will be distributed across the different nodes. See the following details :

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

bleonhardi — Tue, 10 May 2016 18:51:39 GMT

In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together).

This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start.

So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB.

There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields.

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

issaq_mohd — Tue, 10 May 2016 19:19:51 GMT

Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

issaq_mohd — Tue, 10 May 2016 19:19:54 GMT

Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

bleonhardi — Tue, 10 May 2016 19:32:58 GMT

Not exactly sure what you try to say with "codec mechanism" But if you mean if you could transform a single big GZ file into small gz files or into uncompressed files you would most likely use pig:

http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-pig

To specify a number of writers you will need to force reducers.

http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-instead-of-thousands-of-ti

And here are some tips on setting the number of reducers:

http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features

Instead of pig you could also write a small MapReduce job, here you are more flexible for the price of a bit of coding. Or Spark might work too. Or Hive using the DISTRIBUTE BY keyword.

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

pradeep_bhadani — Tue, 10 May 2016 19:46:01 GMT

You can achieve this by reading "non-splittable" compressed format in Single Mapper and then distributing data using Reducer to multiple nodes.

HDFS will store data on multiple node even if files are compressed (using non-splittable or splittable codec) .HDFS will split the compressed file based on the block size. While reading file back in a MR job , your MR job will have a single mapper if your file is compressed using non-splittable codec otherwise (splittable codec) MR Job will have multiple mapper to read data.

How Data is distributed :

Suppose you have 1024MB of compressed file and your Hadoop cluster have 128MB of block size.

When you upload the compressed file to HDFS , it will get converted into 8blocks (128MB each block size) and distributed to different nodes of cluster. HDFS would take care about which node should receive block in a cluster depending on cluster health/ node health/ HDFS balance.

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

christian_proko — Tue, 10 May 2016 20:48:06 GMT

Hello @Issaq Mohammad,

Here are some useful posts on file formats:

I hope that helps you to navigate the space a bit better.

Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?

SQLShaw — Wed, 11 May 2016 00:11:43 GMT

Here is a great writeup on file compression in Hadoop - http://comphadoop.weebly.com/