Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Trying to build a new HA cluster setup

Explorer

Hi Team,

I have been trying to build a ha cluster set up, but could not set it up properly. Every it fails while starting zkfc service. Not sure where went wrong.

This is what it shows up when I tried to start zkfc controller after starting journalnode controllers.

17/08/25 04:48:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString= master1:2181,master2:2181:slave1:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@1417e278
17/08/25 04:48:51 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now
java.net.UnknownHostException: master1
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:922)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1316)
at java.net.InetAddress.getAllByName0(InetAddress.java:1269)
at java.net.InetAddress.getAllByName(InetAddress.java:1185)
at java.net.InetAddress.getAllByName(InetAddress.java:1119)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:628)
at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at org.apache.hadoop.ha.ZKFailoverController.initZK(ZKFailoverController.java:350)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:191)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
root@master1:~#

Thanks

1 ACCEPTED SOLUTION

Super Mentor

@Kasim Shaik

The following error indicates that you might not have configured the FQDN properly in your cluster.

java.net.UnknownHostException: master1

Can you please check if the "hostname -f" command actually returns you the same desired FQDN?

Example:

root@master1:~#    hostname -f

.

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation-ppc/content/set_the_...

Every node of your cluster should be able to resolve the nodes properly with the FQDN correctly.

View solution in original post

14 REPLIES 14

Super Mentor

@Kasim Shaik

The following error indicates that you might not have configured the FQDN properly in your cluster.

java.net.UnknownHostException: master1

Can you please check if the "hostname -f" command actually returns you the same desired FQDN?

Example:

root@master1:~#    hostname -f

.

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation-ppc/content/set_the_...

Every node of your cluster should be able to resolve the nodes properly with the FQDN correctly.

Explorer

Hi Jay,

Thanks for the reply.

I replaced the hostname with FQDN domain and ran the same command. It worked successfully. However, ran into another problem. After formatting zkfc, ran name -format command and landed in another problem.

`````

17/08/25 05:43:09 INFO common.Storage: Storage directory /home/kasim/journal/tmp/dfs/name has been successfully formatted.
17/08/25 05:43:09 WARN namenode.NameNode: Encountered exception during format:
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 1 exceptions thrown:
10.104.10.16:8485: Cannot create directory /home/kasim/dfs/jn/ha-cluster/current
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:337)
at org.apache.hadoop.hdfs.qjournal.server.JNStorage.format(JNStorage.java:190)
at org.apache.hadoop.hdfs.qjournal.server.Journal.format(Journal.java:217)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.format(JournalNodeRpcServer.java:141)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.format(QJournalProtocolServerSideTranslatorPB.java:145)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25419)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.format(QuorumJournalManager.java:214)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.formatNonFileJournals(FSEditLog.java:392)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:162)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:992)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1434)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
17/08/25 05:43:09 ERROR namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 1 exceptions thrown:
10.104.10.16:8485: Cannot create directory /home/kasim/dfs/jn/ha-cluster/current
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:337)
at org.apache.hadoop.hdfs.qjournal.server.JNStorage.format(JNStorage.java:190)
at org.apache.hadoop.hdfs.qjournal.server.Journal.format(Journal.java:217)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.format(JournalNodeRpcServer.java:141)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.format(QJournalProtocolServerSideTranslatorPB.java:145)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25419)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.format(QuorumJournalManager.java:214)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.formatNonFileJournals(FSEditLog.java:392)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:162)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:992)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1434)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
17/08/25 05:43:09 INFO util.ExitUtil: Exiting with status 1
17/08/25 05:43:09 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at odc-c-01.prc.eucalyptus-systems.com/10.104.10.1
************************************************************/

````

I checked the folder structure, it already got created.

```

/home/kasim/dfs/jn/ha-cluster/current

[root@odc-c-01 name]# cd current/
[root@odc-c-01 current]# ls
seen_txid VERSION
[root@odc-c-01 current]# pwd
/home/kasim/journal/tmp/dfs/name/current
[root@odc-c-01 current]#

````

Thanks,

Super Mentor

@Kasim Shaik

The error is :

WARN namenode.NameNode: Encountered exception during format: org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 1 exceptions thrown:
10.104.10.16:8485: Cannot create directory /home/kasim/dfs/jn/ha-cluster/current

.

- Please check the permission on the directory, The user who is running the NameNode format should be able to write to that directory.

# ls -ld  /home/kasim/dfs/
# ls -ld  /home/kasim/dfs/jn
# ls -ld  /home/kasim/dfs/jn/ha-cluster
# ls -ld  /home/kasim/dfs/jn/ha-cluster/current
# ls -lart  /home/kasim/dfs/jn/ha-cluster/current

.

Explorer

@Jay SenSharma

The folder is created on a filer. I am running as root user. User "root" has all privileges on that folder.

{code}

[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/
drwxr-xr-x 3 nobody nobody 4096 Aug 23 03:31 /home/kasim/dfs/
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/jn
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:43 /home/kasim/dfs/jn
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/jn/ha-cluster
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 /home/kasim/dfs/jn/ha-cluster
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/jn/ha-cluster/current
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 /home/kasim/dfs/jn/ha-cluster/current
[root@odc-c-01 kasim]# ls -lart /home/kasim/dfs/jn/ha-cluster/current
total 16
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 ..
-rwxr-xr-x 1 nobody nobody 154 Aug 25 05:59 VERSION
drwxr-xr-x 2 nobody nobody 4096 Aug 25 05:59 paxos
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 .
[root@odc-c-01 kasim]#

{code}

Mentor

@Kasim Shaik

Do you know how to use blueprints ? I could hep you on that to deploy without any fuss!

Explorer

Yes, Please.

Mentor

@Kasim Shaik

Can you tell me the number of master node and datanodes or edge nodes you want in your cluster?

Explorer

I have a total of 6 machines in my set up. One for active node, one for stand by node, one for resources manager and remaining 3 machines for data nodes. My question is dfs.journalnode.edits.dir location should ba remote shared directory or it can be on local filesystem with uniform directory structure across all journal nodes.

Mentor

@Kasim Shaik

With 6 machines you could have

2 master nodes for HDFS HA
1 edge node with clients/Ambari server
3 data nodes

What version of HDP? Will you use Mysql for hive/ranger/oozie ?

Is that fine for you

Mentor

@Kasim Shaik

In the cluster_config.json change the following

Stack_version to match your version "2.x"

In the hostmap.json change the masterx,datanodex or ambari-server to match FQDN of the machines.

Make sure you have internal repos to match the entries in repo.json and dputil-repo.jso

Cli.txt change the "ambari-server" to match your FQDN of your Ambai server and launch them in that order

Remeber to rename the *.json.txt to *.json as HCC doesn't accept .json file type upload

Explorer

Hi @Jay SenSharma,

Finally, I was able to configure HA cluster successfully. Fail-over is happening when I tried to do the same using "hdfs haadmin -failover" command. However, I noticed, fsimage & edit log files only in one server.

[root@odc-c-01 current]# hdfs haadmin -getServiceState odc-c-01
standby
[root@odc-c-01 current]# hdfs haadmin -getServiceState odc-c-16
active
[root@odc-c-01 current]#

<property>
<name>hadoop.tmp.dir</name>
<value>/shared/kasim/journal/tmp</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/shared/kasim/dfs/jn</value>
</property>
</configuration>

[root@odc-c-01 current]# ls fsimage_0000000000000003698
fsimage_0000000000000003698
[root@odc-c-01 current]#
[root@odc-c-01 current]# pwd
/shared/kasim/journal/tmp/dfs/name/current
[root@odc-c-01 current]#

Still I do not understand why it is writing fsimage & edits log information to only one server and in a different directory which I have not mention for "dfs.journalnode.edits.dir". Could you shed some light over on that part.

Thanks,

Mentor

@Kasim Shaik

Can you check paste the screenshot of the below directories

Ambari UI-->HDFS-->Configs-->NammeNode directories

If you have ONLY one directory path then that explains why you have only one copy

Explorer

@Geoffrey Shelton Okot

I have had used apache hadoop tar file to configure HA, not amabari GUI

Thanks,

Mentor

@Kasim Shaik

It doesn't matter whether you used tarball and blueprint which I sent you. After the installation how are you managing your cluster ? I guess by Ambari not so ?

Just check how many directories are in Ambari UI-->HDFS-->Configs-->NameNode directories

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.