In the YARN framework, how does the Resource Manager respect the notion of data locality in its scheduling logic? Does the Resource Manager have any communication with the NameNode which holds the metadata about the HDFS? If it doesn't how does it know to localise and schedule containers on nodes that hold the data require for the job in question? If it does not need to know about the metadata of the HDFS, which is held in the NameNode, then how could it possible allocate containers to nodes that hold the data for the job?
Thanks for your response, but there is still a link missing in my mind. Please, allow me to restate the problem in the following terms:
Supposing I was running a non-MapReduce application to read data in a file already in HDFS and process the data in the file. I submit my job to the cluster. My questions are:
1) Does the client communicate with the ResourceManager (RM) or the NameNode(NN) at job submission time?
2) From which component does the client get the file metadata information to be able to calculate the input splits? From RM or from the NN?
3) From your previous response, you said the input splits are calculated and written to a descriptor file in HDFS. I suppose it is the information in this file that will be used by the applicationMaster (AM) when requesting container from the RM. In this way, the various containers can be allocated to the nodes hosting the data. Can you confirm, please?
4) Are input splits alway calculated by the client on the client-sde, or can this be configured to be done on the cluster?