Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Role of edge node & worker node in file copying

avatar

I'm copying a file from unix server to HDFS. I believe Edge node will act as a gateway for ingest data into HDFS. Consider I have 5 GB of file which I'm trying to copy into HDFS. Where will the data be stored? I understand that it will be stored in the data node. But before the entire file is placed into a data node, it will be placed in staging/intermediate layer. Will edge node holds the place for that staging layer?

5 REPLIES 5

avatar
Master Mentor

@Bala Vignesh N V

Edge nodes are not designed to store the data. The Clients running inside the Edge Nodes (like HDFS client) are actually responsible for performing the operations like copy/put of files to HDFS (metadata will be stored on NameNode and the DataNodes will actually store the data/content of the file.

.

Following is the content from an old doc: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.

.

There are some good information available on edge nodes which can be found in the following links.

http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/

https://dwbi.org/etl/bigdata/187-set-up-client-node-gateway-node-in-hadoop-cluster

See this post making the good point of installing hadoop binaries via Ambari so they are always up to date with the rest of the cluster.https://community.hortonworks.com/questions/39568/how-to-create-edge-node-for-kerberized-cluster.htm...

avatar
Expert Contributor

Will edge node holds the place for that staging layer?

1) Edge node normally have Hadoop Client installed, using this HDFS client is responsible for data copy/move to DataNode and Metadata stored in Namenode

2) HDFS clent act as :- staging/intermediate layer for DN and NM.

3) Normally they separate edge node, master node and data nodes, resource manager node.

Edge Node :- Will have batch user id, which is responsible for running the batch.

Data Node:- Contain the Physical Data of Hadoop Cluster .

Name Node :- will have metadata of Hadoop Cluster.

avatar

2) HDFS clent act as :- staging/intermediate layer for DN and NM. --> Does it mean whenever I'm copying a file from local to HDFS, edge node will act as a staging layer using the HDFS client which is also installed in edge node?

In turn worker node doesn't have any role to play here. Is my understanding right?

avatar
Expert Contributor

2) HDFS clent act as :- staging/intermediate layer for DN and NM. --

the client contacts the NameNode.TheNameNode inserts the file name into 
the file system hierarchy and allocates a data block for it.
TheNameNode responds to the client request with the identity of the DataNode and the destination data block.

3) In turn worker node doesn't have any role to play here. Is my understanding right? :- No,. The actual task will be done by worker node only, as it JOB assigned by the Resource Manager. Job Work Flow :- HDFS Client -> Namenode ->Resource Manager -> Worker/Data Node ->once all MR task completed Datanode will have actual data and Meta Data Stored Namenode.

avatar
Expert Contributor

1) Edge node normally have Hadoop Client installed, using this HDFS client is responsible for data copy/move to DataNode and Metadata stored in Namenode

2) HDFS clent act as :- staging/intermediate layer for DN and NM. --

the client contacts the NameNode.TheNameNode inserts the file name into 
the file system hierarchy and allocates a data block for it.
TheNameNode responds to the client request with the identity of the DataNode and the destination data block.

3) In turn worker node doesn't have any role to play here. Is my understanding right? :- No,. The actual task will be done by worker node only, as it JOB assigned by the Resource Manager. Job Work Flow :- HDFS Client -> Namenode ->Resource Manager -> Worker/Data Node ->once all MR task completed Datanode will have actual data and Meta Data Stored Namenode.

4) Normally they separate edge node, master node and data nodes, resource manager node.

Edge Node :- Will have batch user id, which is responsible for running the batch.

Data Node:- Contain the Physical Data of Hadoop Cluster .

Name Node :- will have metadata of Hadoop Cluster.

is this help full !