Created 08-25-2017 11:59 AM
Hi Team,
I have been trying to build a ha cluster set up, but could not set it up properly. Every it fails while starting zkfc service. Not sure where went wrong.
This is what it shows up when I tried to start zkfc controller after starting journalnode controllers.
17/08/25 04:48:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString= master1:2181,master2:2181:slave1:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@1417e278
17/08/25 04:48:51 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now
java.net.UnknownHostException: master1
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:922)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1316)
at java.net.InetAddress.getAllByName0(InetAddress.java:1269)
at java.net.InetAddress.getAllByName(InetAddress.java:1185)
at java.net.InetAddress.getAllByName(InetAddress.java:1119)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:628)
at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at org.apache.hadoop.ha.ZKFailoverController.initZK(ZKFailoverController.java:350)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:191)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
root@master1:~#
Thanks
Created 08-25-2017 12:27 PM
The following error indicates that you might not have configured the FQDN properly in your cluster.
java.net.UnknownHostException: master1
Can you please check if the "hostname -f" command actually returns you the same desired FQDN?
Example:
root@master1:~# hostname -f
.
Every node of your cluster should be able to resolve the nodes properly with the FQDN correctly.
Created 08-25-2017 02:48 PM
In the cluster_config.json change the following
Stack_version to match your version "2.x"
In the hostmap.json change the masterx,datanodex or ambari-server to match FQDN of the machines.
Make sure you have internal repos to match the entries in repo.json and dputil-repo.jso
Cli.txt change the "ambari-server" to match your FQDN of your Ambai server and launch them in that order
Remeber to rename the *.json.txt to *.json as HCC doesn't accept .json file type upload
Created 08-28-2017 08:03 AM
Finally, I was able to configure HA cluster successfully. Fail-over is happening when I tried to do the same using "hdfs haadmin -failover" command. However, I noticed, fsimage & edit log files only in one server.
[root@odc-c-01 current]# hdfs haadmin -getServiceState odc-c-01
standby
[root@odc-c-01 current]# hdfs haadmin -getServiceState odc-c-16
active
[root@odc-c-01 current]#
<property>
<name>hadoop.tmp.dir</name>
<value>/shared/kasim/journal/tmp</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/shared/kasim/dfs/jn</value>
</property>
</configuration>
[root@odc-c-01 current]# ls fsimage_0000000000000003698
fsimage_0000000000000003698
[root@odc-c-01 current]#
[root@odc-c-01 current]# pwd
/shared/kasim/journal/tmp/dfs/name/current
[root@odc-c-01 current]#
Still I do not understand why it is writing fsimage & edits log information to only one server and in a different directory which I have not mention for "dfs.journalnode.edits.dir". Could you shed some light over on that part.
Thanks,
Created 08-28-2017 08:11 AM
Can you check paste the screenshot of the below directories
Ambari UI-->HDFS-->Configs-->NammeNode directories
If you have ONLY one directory path then that explains why you have only one copy
Created 08-28-2017 08:15 AM
Created 08-28-2017 08:26 AM
It doesn't matter whether you used tarball and blueprint which I sent you. After the installation how are you managing your cluster ? I guess by Ambari not so ?
Just check how many directories are in Ambari UI-->HDFS-->Configs-->NameNode directories