Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Job Submission in YARN - Resource Manager/ Data localigty

Job Submission in YARN - Resource Manager/ Data localigty

New Contributor



In the YARN framework, how does the Resource Manager respect the notion of data locality in its scheduling logic? Does the Resource Manager have any communication with the NameNode which holds the metadata about the HDFS? If it doesn't how does it know to localise and schedule containers on nodes that hold the data require for the job in question? If it does not need to know  about the metadata of the HDFS, which is held in the NameNode, then how could it possible allocate containers to nodes that hold the data for the job?


Re: Job Submission in YARN - Resource Manager/ Data localigty

Master Guru
YARN and HDFS are isolated systems. YARN RM does not fetch any metadata from HDFS. In fact, YARN has no dependency on HDFS at a high level. You can run YARN on any DFS.

Here's how it roughly works, in MR2 YARN app, for example:

Client fetches input list metadata, writes to a descriptor HDFS file, entries of which carry a list of locations per map task that it has decided to run.
Client runs MR AM.
MR AM loads descriptor file, requests YARN for container resource with a set of hosts X (as written by the client originally).
YARN does its best to grant such a requested host container.
MR AM runs the granted container, and the container indirectly gets its 'data local' feature.

Re: Job Submission in YARN - Resource Manager/ Data localigty

New Contributor

Hello Harsh,


Thanks for your response, but there is still a link missing in my mind. Please, allow me to restate the problem in the following terms:


Supposing I was running a non-MapReduce application to read data in a file already in HDFS and process the data in the file. I submit my job to the cluster. My questions are:


1) Does the client communicate with the ResourceManager (RM) or the NameNode(NN) at job submission time?


2) From which component does the client get the file metadata information to be able to calculate the input splits? From RM or from the NN?


3) From your previous response, you said the input splits are calculated and written to a descriptor file in HDFS. I suppose it is the information in this file that will be used by the applicationMaster (AM) when requesting container from the RM. In this way, the various containers can be allocated to the nodes hosting the data. Can you confirm, please?


4) Are input splits alway calculated by the client on the client-sde, or can this be configured to be done on the cluster?


Many thanks