Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HOW file storage in HDFS is Done? please go through details.

avatar
Expert Contributor

lets assume we have a file of 300 MB and block size of HDFS is 128 MB and hence the file is divided into 128(B1),128(B2),44(B3).. so

1. where does this splitting of huge file takes place.?

As many people say "client splits the file" .what a client actually is??

HDFS client (if yes can you give me the flow from executed command like -put to HDFS client to Nanenodes and Datanodes) or any other external tools (if yes example).

2.Does the client forms 3 pipelines for each block to replicate which run in parallel??.

3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full??

And of my third point is true... doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" rather just replicating as soon as we get the chunks of data of total block.

Can you also provide the possible reasons why the flow is not the otherwise.

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
6 REPLIES 6

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor

@Benjamin Leonhardi

satisfied with your answer. But for the second question , am taking about each chunk of file divided. not about replicas of the block.

avatar
Master Guru

Not sure what you mean with chunk. Essentially a stream of data is piped into the HDFS write API. Each 128MB a new block is created internally. Inside each block the buffer sends data when a network package is full ( 64KB or so )

So essentially

1GB file is written into HDFS API

- Block1 is created on (ideally local) node1, copy on node2 and node3

- data is streamed into it, in 64KB chunks, from client to node1, whenever datanode receives 64KB chunk it writes it to disc into the block and tells client that write was successful and at the same time sends a copy to node2

-node2 writes chunk to its replica of the block and sends data to node3

-node3 writes chunk to block on disc

- next 64kb chunk is send from client to node1 ...

- 128MB is full and next block is create.

The write is successful once the client received notification from node1 that it successfully wrote the last block

If node1 dies during the write client will rewrite blocks on a different node.

...

avatar
Expert Contributor

@Benjamin Leonhard

Thank you for your quick and explanatory answers. Can you please clarify few more doubts i have,

1) What is the reason behind storing the output of MapReduce to HDFS?? why cant we directly send to client or display them. What happens to the output files?? are they stored permanently or flushed after some time?? if so on what basis??

2) Will MapReduce run when we read the data from the HDFS??

avatar
Master Guru

1) Normally Mapreduce reads and creates very large amounts of data. The framework is also parallel and failed tasks can be rerun, so until all tasks have finished you are not sure what the output is. You can write a program that returns data to the caller directly obviously but this is not the norm. Hive for example writes files to a tmp dir and then the hiveserver uses the hdfs client to read the results. In pig you have the options to store (save in hdfs ) or dump ( show on screen ) data. But Not sure if pig also utilizes a tmp file here. In Mapreduce you can do whatever you want.

2) Mapreduce is used when you want to run computations in parallel on the cluster. So pig/hive utilize it. But you can also just read the data directly using the client. However in that case you have a single threaded read,

avatar
Rising Star

@Benjamin Leonhardi "but it is written in parallel to the other two nodes in a chain" Can you explain this ? What do you mean by a chain? Are you telling its sequential ?