Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Name Nodes and Data Nodes Directories and related questions

Name Nodes and Data Nodes Directories and related questions

New Contributor

Excluding OS disks(RAID1), I have 3 other disks in management, NameNode and Secondary NameNode and 6 other disks in DataNode 1-3. They are mounted as shown in the table below:


Management Node NameNode Secondary NameNode
XFS XFS XFS
Mount Point Mount Point Mount Point
/grid/1 /grid/1 /grid/1
/grid/2 /grid/2 /grid/2
/grid/3 /grid/3 /grid/3
DataNode 1 DataNode 2 DataNode 3
XFS XFS XFS
Mount Point Mount Point Mount Point
/grid/1 /grid/1 /grid/1
/grid/2 /grid/2 /grid/2
/grid/3 /grid/3 /grid/3
/grid/4 /grid/4 /grid/4
/grid/5 /grid/5 /grid/5
/grid/6 /grid/6 /grid/6

Question1

Do I specify the following under "Customize Services" section for HDFS:

NameNode Directories:

/grid/1/hadoop/hdfs/namenode, /grid/2/hadoop/hdfs/namenode, /grid/3/hadoop/hdfs/namenode

DataNode Directories:

/grid/1/hadoop/hdfs/data, /grid/2/hadoop/hdfs/data, /grid/3/hadoop/hdfs/data, /grid/4/hadoop/hdfs/data, /grid/5/hadoop/hdfs/data, /grid/6/hadoop/hdfs/data

Question2

Will the Ambari wizard configure the same namenode directories for the secondary namenode?

Question3

Is it OK for me to specify the following directories under YARN Section for all DataNodes:

For yarn.nodemanager.localdirs:

/grid/1/hadoop/yarn/local, /grid/2/hadoop/yarn/local, /grid/3/hadoop/yarn/local, /grid/4/hadoop/yarn/local, /grid/5/hadoop/yarn/local, /grid/6/hadoop/yarn/local

For yarn.nodemanager.logdirs:

/grid/1/hadoop/yarn/log, /grid/2/hadoop/yarn/log, /grid/3/hadoop/yarn/log, /grid/4/hadoop/yarn/log, /grid/5/hadoop/yarn/log, /grid/6/hadoop/yarn/log

Question4

If I have assigned the following services to the management node:

WebHCat Server

HiveServer2

Infra Solr Instance*

Grafana Metrics Collector*

Activity Explorer*

Activity Analyzer*

HST Server

NFS Gateway (Assign Slaves and Clients Section)

Client (Assign Slaves and Clients Section)

Any recommendations on how I should best utilize the 3 disks(i.e: /grid/1, /grid/2, /grid/3) in the management node?

Question5

I have the following services under namenode and secondary namenode:

NameNode SNameNode
ResourceManager ZooKeeper Server*
History Server
ZooKeeper Server*
Spark2 History Server
Hive Metastore
App Timeline Server

Will the secondary namenode be assigned all the services for primary namenode besides Zookeeper? Any concerns with the above services assigned to namenode?

2 REPLIES 2
Highlighted

Re: Name Nodes and Data Nodes Directories and related questions

New Contributor

Question 1: You only really need 2 copies of the namenode directory for redundancy, but really no harm in having the third. The data node directories look correct. Here is a good article on this: https://community.hortonworks.com/questions/39988/name-node-and-data-node-directories.html

Question 2: Yes. Keep in mind the functionality of the secondary namenode: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_N... - It is much different than a Standby Namenode (which functions on a HA capacity and does require a separate directory stucture).

Question 3: Yes it is OK to spread it across the multiple directories. Here is a good article on log management as well: https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

Question 4: Are you saying you have 1 management node where you want to isolate only the Ambari related functions? Well, there is a small issue with that but first, if that is the case, you would want to remove WebHCat and HiveServer2 from that node. You would probably want to remove the NFS gateway as well (again if you really are looking for isolation of the management node). And in relation to Question 5, you should distribute those services out to the other master nodes (keeping HiveServer2 close to metastore is generally not a bad idea). That being said, you really need a zookeeper quorum. Having 2 zookeepers goes against best practices (causes deadlock scenarios); so you should either go with 1 (no redundancy) or 3. In the case of 3, you don't want to put a third one on a data node, so your option is to put it on the management node. In that case, you have lost your isolation of the management node, but without knowing more about your scenario, I'd suggest going with the 3 zookeepers and putting the 3rd one on the Ambari/management node. I don't think I can prescribe your disk layout for this server but here are some considerations: do you have enough space for product, product upgrades, and log files on the OS pair, or do you need to shave off space onto one of the other drives? Depending on your zookeeper workload, it is a best practice to give zookeeper a separate disk. Are you using log search? It can be configured to use local disk in which case, it may benefit from its own drive. How are you configuring the AMS service? This can benefit from having its own drive.

Question 5: You should definitely consider redistributing the master services across the master nodes. As already mentioned, definitely use the management node to put the 3rd zookeeper. For the node you are calling "NameNode" you would definitely have the primary namenode, and you can also put the standby resource manager there. Then I would lump together the hive services there (Hiveserver2, Hive Metastore). On the 2nd master node (which you are calling SNameNode), I would put the Primary Resource Manager and the other yarn services (history servers, timeline server, WebHCat). And this could also have the SNameNode; although in this architecture, this seems a little strange. You would really want this box to have a Standby Namenode (not the secondary namenode). In fact, it's probably better to have the Secondary Namenode with the primary, and put a Standby Namenode on the second master node. Well, none of these recommendations are set in stone, there are many ways to slice and dice this.

Re: Name Nodes and Data Nodes Directories and related questions

New Contributor

Thanks Alex for your detailed explanation. No, I'm not trying to isolate Ambari related functions to the management node. I've also missed out Zookeeper in the management server listing above. You are right, I have 3 Zookeeper in my setup.

It is stated in the "Assign Masters" section that "*HiveServer2 and WebHCat Server will be hosted on the same host". WebHCat Server was automatically assigned to the management server in the Ambari setup wizard. It cannot be changed.

From your explanation, the secondary namenode need not be a passive standby for failover but can take over some services for load balancing during normal operation?. A primary namenode can double up as a secondary namenode. Also, a standby namenode is not necessary the secondary namenode. Master node and Name node are not the same. Am I reading it correctly? How do I spell out which node should be my standby namenode? Is Snamenode in the Master section referring to Standby NameNode or Secondary NameNode?