Created on 02-10-2017 09:08 PM - edited 09-16-2022 04:03 AM
I'm not so clear on the hadoop setup. I understand how it works, but when I see the Yarn configuration, I come to one question. Let's say we have 5 servers and 3 out of 5 servers are used for data node server. In this case, how many node managers should be installed and executed?
Should we have node manager on all 5 servers or on only 3 data node servers?
Created 02-10-2017 10:22 PM
@Shigeru Takehara, Theoretically you can have any number of node managers. It typically depends on your work load and scale of resources you need.
I would suggest you to install atleast 3 node managers ( preferably on the servers where datanodes are running). This way node managers can find the data locally.
However, You can choose to add more node managers on other nodes if you need more resources to run your applications.
Created 02-10-2017 10:22 PM
@Shigeru Takehara, Theoretically you can have any number of node managers. It typically depends on your work load and scale of resources you need.
I would suggest you to install atleast 3 node managers ( preferably on the servers where datanodes are running). This way node managers can find the data locally.
However, You can choose to add more node managers on other nodes if you need more resources to run your applications.
Created 02-10-2017 10:29 PM
When we need to have more resources, which means typically, at the time when we add more data node servers? This makes sense to me. What I'm not so clear is let's say without adding anything with my example 5 servers setup, what does it mean to install node manager to all 5 servers? Even if we install more node managers, it does not mean we increase the resources because the number of data node servers stay 3. On the other hand, if we install node manager to all 5 servers, Yarn configuration may give us wrong resource information?
Thank you,
Created 02-10-2017 11:11 PM
@Shigeru Takehara, datanodes are part of HDFS and node managers are part of Yarn. Datanodes are used to store data on HDFS whereas Nodemanagers are used to start a container on Yarn.
que: What I'm not so clear is let's say without adding anything with my example 5 servers setup, what does it mean to install node manager to all 5 servers?
ans: when you add 5 servers for node managers that means that Yarn can launch containers on 5 nodes. There is no strict rule that datanodes and node managers have to be on same host . In this case, the containers running on hosts where datanodes aren't installed will still run application by copying data from datanodes.
que: Even if we install more node managers, it does not mean we increase the resources because the number of data node servers stay 3
ans: Do not confuse the resource definition for Yarn and HDFS. Yarn is a framework to run a application on a distributed filesystem and HDFS is a file system to store data. The meaning of "resource" for Yarn is total number of containers which can run at a time on a cluster. By increasing node managers, you are adding capacity to Yarn. There will be no change in data nodes or HDFS capacity.
que: On the other hand, if we install node manager to all 5 servers, Yarn configuration may give us wrong resource information?
ans: you can install node managers on all 5 nodes. There will be no complain in yarn configuration. The configuration for HDFS(hdfs-site.xml) and YARN(yarn-site.xml) are different.
Created 02-10-2017 11:24 PM
I understand Yarn and HDFS are separately defined.
One thing that confuses me is that when I see Yarn's resource manager UI, it shows max memory size and max vcore, I guess, which is not an actual figure from data node servers but from the Yarn configuration? If the yarn has not actual configuration information, how can we know the accurate information of the current processing? I also wonder how resource manager can allocate resource correctly. (sorry about too many questions...)
Thank you,
Created 02-11-2017 12:44 AM
You are correct. Yarn reads the memory/vcore related information from yarn configuration only. Typically, admin is responsible to specify correct memory/vcore data to yarn. In a hadoop cluster, a node is shared with multiple services like datanode, region server etc. Suppose a node has 36GB of disk and it has 3 daemons running such as data node, node manager and region server. An admin may choose to give all 3 services equal memory. In this case, admin will need to update yarn-site.xml to have "yarn.nodemanager.resource,memory-mb=12000". Same goes for vcore. Refer to http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ to understand how to configure yarn memory correctly.
You can use Ambari to install the cluster. It has feature called stack advisor which will set up the cluster with recommended configs. It will actually query the hosts and gather all necessary data such as disk space, RAM etc and depending on what daemons are configured on a host, it will set up the configuration.
Created 02-11-2017 02:42 AM
Yes, I use Ambari. I have one more question to clear my head.
Assume I have nodes that have at least 20GB memory and 4 cores for each node, and each node has node manager installed.
If I add one old computer that has 5GB memory and 2 cores, and if I install node manager on this new computer, Yarn memory setting for one node allocation should be reduced at max 5GB? This would be because Yarn configuration is made against all node managers.
I am asking because in my cluster, the Ambari Yarn's memory setting, especially for node, the max memory size is the lowest of all nodes that has node manager installed. In other words, if we want to utilize resources, we should add nodes that have similar memory size and number of cores?
Thanks you,
Created 02-11-2017 05:28 AM
@Shigeru Takehara, You can definitely have different nodes with different memory/cpu in a cluster. You can have a 5GB memory and 2 Cores machine as one of your nodes without changing max to 5GB globally. You can set yarn.nodemanager.resource,memory-mb=20000 on machines with 20gb memory and set yarn.nodemanager.resource,memory-mb=5000 on machine with 5gb memory.
You can also manage different configuration on different node managers using ambari. Its called host config groups.
 
					
				
				
			
		
