Created 07-05-2016 04:26 AM
lets assume we have a file of 300 MB and block size of HDFS is 128 MB and hence the file is divided into 128(B1),128(B2),44(B3).. so
1. where does this splitting of huge file takes place.?
As many people say "client splits the file" .what a client actually is??
HDFS client (if yes can you give me the flow from executed command like -put to HDFS client to Nanenodes and Datanodes) or any other external tools (if yes example).
2.Does the client forms 3 pipelines for each block to replicate which run in parallel??.
3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full??
And of my third point is true... doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" rather just replicating as soon as we get the chunks of data of total block.
Can you also provide the possible reasons why the flow is not the otherwise.
Created 07-05-2016 10:45 AM
1. where does this splitting of huge file takes place.?
A Client is a (mostly) Java program using the HDFS Filesystem API to write a file to HDFS. This can be the hadoop command line client or a program running in the cluster ( like mapreduce, ... ) In this case each mapper/reducer who writes to HDFS will write one file ( you may have seen mapreduce output folders that contain part-0000 part-0001 files, these are the files written by each mapper/reducer. MapReduce takes these folders as if they are one big file.
If you write a file with a client ( let's say one TB) the file is written into the API and transparently chunked into 128MB blocks by the API.
2.Does the client forms 3 pipelines for each block to replicate which run in parallel??.
No its a chain. The block is committed once it is persisted on the first node, but it is written in parallel to the other two nodes in a chain
Client -> Node1 -> Node2(different rack )->node3(same rack as node2)
3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full??
Yes HDFS API writes in buffer fields I think 64kb or so? So every buffer package is written through at the same time
"doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" "
Never heard of that replication principle. And it is definitely not true in HDFS. A file doesn't even need three copies to be written successful a put operation is successful if it is persisted on ONE node. The namenode would make sure that the correct replication level is reached eventually.
"Can you also provide the possible reasons why the flow is not the otherwise.
Because its faster? If you had to wait for three copies sequentially a put would take much longer to succeed. A lot in HDFS is effiiency
Created 07-05-2016 10:45 AM
1. where does this splitting of huge file takes place.?
A Client is a (mostly) Java program using the HDFS Filesystem API to write a file to HDFS. This can be the hadoop command line client or a program running in the cluster ( like mapreduce, ... ) In this case each mapper/reducer who writes to HDFS will write one file ( you may have seen mapreduce output folders that contain part-0000 part-0001 files, these are the files written by each mapper/reducer. MapReduce takes these folders as if they are one big file.
If you write a file with a client ( let's say one TB) the file is written into the API and transparently chunked into 128MB blocks by the API.
2.Does the client forms 3 pipelines for each block to replicate which run in parallel??.
No its a chain. The block is committed once it is persisted on the first node, but it is written in parallel to the other two nodes in a chain
Client -> Node1 -> Node2(different rack )->node3(same rack as node2)
3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full??
Yes HDFS API writes in buffer fields I think 64kb or so? So every buffer package is written through at the same time
"doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" "
Never heard of that replication principle. And it is definitely not true in HDFS. A file doesn't even need three copies to be written successful a put operation is successful if it is persisted on ONE node. The namenode would make sure that the correct replication level is reached eventually.
"Can you also provide the possible reasons why the flow is not the otherwise.
Because its faster? If you had to wait for three copies sequentially a put would take much longer to succeed. A lot in HDFS is effiiency
Created 07-05-2016 11:04 AM
Created 07-05-2016 11:16 AM
Not sure what you mean with chunk. Essentially a stream of data is piped into the HDFS write API. Each 128MB a new block is created internally. Inside each block the buffer sends data when a network package is full ( 64KB or so )
So essentially
1GB file is written into HDFS API
- Block1 is created on (ideally local) node1, copy on node2 and node3
- data is streamed into it, in 64KB chunks, from client to node1, whenever datanode receives 64KB chunk it writes it to disc into the block and tells client that write was successful and at the same time sends a copy to node2
-node2 writes chunk to its replica of the block and sends data to node3
-node3 writes chunk to block on disc
- next 64kb chunk is send from client to node1 ...
- 128MB is full and next block is create.
The write is successful once the client received notification from node1 that it successfully wrote the last block
If node1 dies during the write client will rewrite blocks on a different node.
...
Created 07-05-2016 12:46 PM
Thank you for your quick and explanatory answers. Can you please clarify few more doubts i have,
1) What is the reason behind storing the output of MapReduce to HDFS?? why cant we directly send to client or display them. What happens to the output files?? are they stored permanently or flushed after some time?? if so on what basis??
2) Will MapReduce run when we read the data from the HDFS??
Created 07-05-2016 12:59 PM
1) Normally Mapreduce reads and creates very large amounts of data. The framework is also parallel and failed tasks can be rerun, so until all tasks have finished you are not sure what the output is. You can write a program that returns data to the caller directly obviously but this is not the norm. Hive for example writes files to a tmp dir and then the hiveserver uses the hdfs client to read the results. In pig you have the options to store (save in hdfs ) or dump ( show on screen ) data. But Not sure if pig also utilizes a tmp file here. In Mapreduce you can do whatever you want.
2) Mapreduce is used when you want to run computations in parallel on the cluster. So pig/hive utilize it. But you can also just read the data directly using the client. However in that case you have a single threaded read,
Created 09-02-2016 10:10 AM
@Benjamin Leonhardi "but it is written in parallel to the other two nodes in a chain" Can you explain this ? What do you mean by a chain? Are you telling its sequential ?