Created 05-10-2016 11:13 AM
Created 05-10-2016 11:51 AM
In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together).
This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start.
So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB.
There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields.
Created on 05-10-2016 11:22 AM - edited 08-19-2019 01:23 AM
@Issaq Mohammad If replication factor is not 1 then data will be distributed across the different nodes. See the following details :
Created 05-10-2016 11:51 AM
In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together).
This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start.
So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB.
There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields.
Created 05-10-2016 05:11 PM
Here is a great writeup on file compression in Hadoop - http://comphadoop.weebly.com/
Created 05-10-2016 12:19 PM
Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.
Created 05-10-2016 12:32 PM
Not exactly sure what you try to say with "codec mechanism" But if you mean if you could transform a single big GZ file into small gz files or into uncompressed files you would most likely use pig:
To specify a number of writers you will need to force reducers.
And here are some tips on setting the number of reducers:
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features
Instead of pig you could also write a small MapReduce job, here you are more flexible for the price of a bit of coding. Or Spark might work too. Or Hive using the DISTRIBUTE BY keyword.
Created 05-10-2016 12:19 PM
Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.
Created 05-10-2016 12:46 PM
You can achieve this by reading "non-splittable" compressed format in Single Mapper and then distributing data using Reducer to multiple nodes.
HDFS will store data on multiple node even if files are compressed (using non-splittable or splittable codec) .HDFS will split the compressed file based on the block size. While reading file back in a MR job , your MR job will have a single mapper if your file is compressed using non-splittable codec otherwise (splittable codec) MR Job will have multiple mapper to read data.
How Data is distributed :
Suppose you have 1024MB of compressed file and your Hadoop cluster have 128MB of block size.
When you upload the compressed file to HDFS , it will get converted into 8blocks (128MB each block size) and distributed to different nodes of cluster. HDFS would take care about which node should receive block in a cluster depending on cluster health/ node health/ HDFS balance.
Created 05-10-2016 01:48 PM
Hello @Issaq Mohammad,
Here are some useful posts on file formats:
I hope that helps you to navigate the space a bit better.