Could you please tell me why do we segregate the compute node from storage node in the hadoop world, In this way are we not breaking the data locality's philosophy so in this way we are achieving the intra/inter rack data locality not the local data locality and what is the hurdle we faced in the previous design putting the both(compute and data node) on the same node).
@Faizan123 We are not segregating compute node and data node. Compute node is a node manager and data node is used for storage. If you submit the job the yarn will try to create the task containers on the node where the data is located. The name we use node manager or compute node is used by yarn containers for processing the data. The name data node is used for storing the data. Both can be in a single node.
Please let me know if you have any queries. Also mark "Accept as Solution" if my answer helps you!
Namenode [Master] and Datanode [Slave] are part of HDFS, which is the storage layer, and ResourceManager[Master] and NodeManager [Slave] are part of YARN, which is a Resource Negotiator. So HDFS and YARN work together usually but are quite independent at design and architecture but their slave processes run together on the compute nodes i.e DataNode and a NodeManager process.
This a high-level architecture of RM and NM the 2 master processes and the latter being the brain of Hadoop
Below is a standard layout of a Hadoop cluster though we could have easily added a second RM for HA
On the 12 compute nodes the NM and DN and co-located for the localized processing
It's illogical to separate the DN and NM on different nodes. The NodeManager is YARN’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster. Updating the ResourceManager (RM) with the status of running jobs on the DN, overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management, and auxiliary services which may be exploited by different YARN applications.
DataNodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data. That's the reason DN and NM are co-located on the same VM/host.
It could be very interesting to see a screenshot of the roles co-located with your data nodes.
Hope that gives you a clearer picture
@Faizan123, has any of the replies helped resolve your issue? If so, can you kindly mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?