Created 08-25-2017 11:59 AM
Hi Team,
I have been trying to build a ha cluster set up, but could not set it up properly. Every it fails while starting zkfc service. Not sure where went wrong.
This is what it shows up when I tried to start zkfc controller after starting journalnode controllers.
17/08/25 04:48:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString= master1:2181,master2:2181:slave1:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@1417e278
17/08/25 04:48:51 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now
java.net.UnknownHostException: master1
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:922)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1316)
at java.net.InetAddress.getAllByName0(InetAddress.java:1269)
at java.net.InetAddress.getAllByName(InetAddress.java:1185)
at java.net.InetAddress.getAllByName(InetAddress.java:1119)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:628)
at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at org.apache.hadoop.ha.ZKFailoverController.initZK(ZKFailoverController.java:350)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:191)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
root@master1:~#
Thanks
Created 08-25-2017 12:27 PM
The following error indicates that you might not have configured the FQDN properly in your cluster.
java.net.UnknownHostException: master1
Can you please check if the "hostname -f" command actually returns you the same desired FQDN?
Example:
root@master1:~# hostname -f
.
Every node of your cluster should be able to resolve the nodes properly with the FQDN correctly.
Created 08-25-2017 12:27 PM
The following error indicates that you might not have configured the FQDN properly in your cluster.
java.net.UnknownHostException: master1
Can you please check if the "hostname -f" command actually returns you the same desired FQDN?
Example:
root@master1:~# hostname -f
.
Every node of your cluster should be able to resolve the nodes properly with the FQDN correctly.
Created 08-25-2017 12:51 PM
Hi Jay,
Thanks for the reply.
I replaced the hostname with FQDN domain and ran the same command. It worked successfully. However, ran into another problem. After formatting zkfc, ran name -format command and landed in another problem.
`````
17/08/25 05:43:09 INFO common.Storage: Storage directory /home/kasim/journal/tmp/dfs/name has been successfully formatted.
17/08/25 05:43:09 WARN namenode.NameNode: Encountered exception during format:
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 1 exceptions thrown:
10.104.10.16:8485: Cannot create directory /home/kasim/dfs/jn/ha-cluster/current
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:337)
at org.apache.hadoop.hdfs.qjournal.server.JNStorage.format(JNStorage.java:190)
at org.apache.hadoop.hdfs.qjournal.server.Journal.format(Journal.java:217)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.format(JournalNodeRpcServer.java:141)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.format(QJournalProtocolServerSideTranslatorPB.java:145)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25419)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.format(QuorumJournalManager.java:214)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.formatNonFileJournals(FSEditLog.java:392)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:162)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:992)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1434)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
17/08/25 05:43:09 ERROR namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 1 exceptions thrown:
10.104.10.16:8485: Cannot create directory /home/kasim/dfs/jn/ha-cluster/current
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:337)
at org.apache.hadoop.hdfs.qjournal.server.JNStorage.format(JNStorage.java:190)
at org.apache.hadoop.hdfs.qjournal.server.Journal.format(Journal.java:217)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.format(JournalNodeRpcServer.java:141)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.format(QJournalProtocolServerSideTranslatorPB.java:145)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25419)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.format(QuorumJournalManager.java:214)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.formatNonFileJournals(FSEditLog.java:392)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:162)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:992)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1434)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
17/08/25 05:43:09 INFO util.ExitUtil: Exiting with status 1
17/08/25 05:43:09 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at odc-c-01.prc.eucalyptus-systems.com/10.104.10.1
************************************************************/
````
I checked the folder structure, it already got created.
```
/home/kasim/dfs/jn/ha-cluster/current
[root@odc-c-01 name]# cd current/
[root@odc-c-01 current]# ls
seen_txid VERSION
[root@odc-c-01 current]# pwd
/home/kasim/journal/tmp/dfs/name/current
[root@odc-c-01 current]#
````
Thanks,
Created 08-25-2017 12:56 PM
The error is :
WARN namenode.NameNode: Encountered exception during format: org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 1 exceptions thrown: 10.104.10.16:8485: Cannot create directory /home/kasim/dfs/jn/ha-cluster/current
.
- Please check the permission on the directory, The user who is running the NameNode format should be able to write to that directory.
# ls -ld /home/kasim/dfs/ # ls -ld /home/kasim/dfs/jn # ls -ld /home/kasim/dfs/jn/ha-cluster # ls -ld /home/kasim/dfs/jn/ha-cluster/current # ls -lart /home/kasim/dfs/jn/ha-cluster/current
.
Created 08-25-2017 01:03 PM
The folder is created on a filer. I am running as root user. User "root" has all privileges on that folder.
{code}
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/
drwxr-xr-x 3 nobody nobody 4096 Aug 23 03:31 /home/kasim/dfs/
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/jn
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:43 /home/kasim/dfs/jn
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/jn/ha-cluster
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 /home/kasim/dfs/jn/ha-cluster
[root@odc-c-01 kasim]# ls -ld /home/kasim/dfs/jn/ha-cluster/current
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 /home/kasim/dfs/jn/ha-cluster/current
[root@odc-c-01 kasim]# ls -lart /home/kasim/dfs/jn/ha-cluster/current
total 16
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 ..
-rwxr-xr-x 1 nobody nobody 154 Aug 25 05:59 VERSION
drwxr-xr-x 2 nobody nobody 4096 Aug 25 05:59 paxos
drwxr-xr-x 3 nobody nobody 4096 Aug 25 05:59 .
[root@odc-c-01 kasim]#
{code}
Created 08-25-2017 01:15 PM
Do you know how to use blueprints ? I could hep you on that to deploy without any fuss!
Created 08-25-2017 01:17 PM
Yes, Please.
Created 08-25-2017 02:11 PM
Can you tell me the number of master node and datanodes or edge nodes you want in your cluster?
Created 08-25-2017 02:17 PM
I have a total of 6 machines in my set up. One for active node, one for stand by node, one for resources manager and remaining 3 machines for data nodes. My question is dfs.journalnode.edits.dir location should ba remote shared directory or it can be on local filesystem with uniform directory structure across all journal nodes.
Created 08-25-2017 02:30 PM
With 6 machines you could have
2 master nodes for HDFS HA 1 edge node with clients/Ambari server 3 data nodes
What version of HDP? Will you use Mysql for hive/ranger/oozie ?
Is that fine for you
Created 08-25-2017 02:48 PM
In the cluster_config.json change the following
Stack_version to match your version "2.x"
In the hostmap.json change the masterx,datanodex or ambari-server to match FQDN of the machines.
Make sure you have internal repos to match the entries in repo.json and dputil-repo.jso
Cli.txt change the "ambari-server" to match your FQDN of your Ambai server and launch them in that order
Remeber to rename the *.json.txt to *.json as HCC doesn't accept .json file type upload
Created 08-28-2017 08:03 AM
Finally, I was able to configure HA cluster successfully. Fail-over is happening when I tried to do the same using "hdfs haadmin -failover" command. However, I noticed, fsimage & edit log files only in one server.
[root@odc-c-01 current]# hdfs haadmin -getServiceState odc-c-01
standby
[root@odc-c-01 current]# hdfs haadmin -getServiceState odc-c-16
active
[root@odc-c-01 current]#
<property>
<name>hadoop.tmp.dir</name>
<value>/shared/kasim/journal/tmp</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/shared/kasim/dfs/jn</value>
</property>
</configuration>
[root@odc-c-01 current]# ls fsimage_0000000000000003698
fsimage_0000000000000003698
[root@odc-c-01 current]#
[root@odc-c-01 current]# pwd
/shared/kasim/journal/tmp/dfs/name/current
[root@odc-c-01 current]#
Still I do not understand why it is writing fsimage & edits log information to only one server and in a different directory which I have not mention for "dfs.journalnode.edits.dir". Could you shed some light over on that part.
Thanks,
Created 08-28-2017 08:11 AM
Can you check paste the screenshot of the below directories
Ambari UI-->HDFS-->Configs-->NammeNode directories
If you have ONLY one directory path then that explains why you have only one copy
Created 08-28-2017 08:15 AM
Created 08-28-2017 08:26 AM
It doesn't matter whether you used tarball and blueprint which I sent you. After the installation how are you managing your cluster ? I guess by Ambari not so ?
Just check how many directories are in Ambari UI-->HDFS-->Configs-->NameNode directories