Created 01-05-2016 06:46 PM
I'm using Ambari 2.1.2 to install a Highly Available HDP 2.3.4 cluster. The service installation is successful on the nodes, but the services fail to start. Digging in to the logs and the config files, I found that the %HOSTNAME::node:port% strings didn't get replaced with the actual hostnames defined in the cluster configuration template. As a result, the config files contain invalid URIs like these:
<property> <name>dfs.namenode.http-address</name> <value>%HOSTGROUP::master_2%:50070</value> <final>true</final> </property> <property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>%HOSTGROUP::master_2%:50070</value> </property>
Sure enough, the errors while starting the services also pointed to the same reason:
[root@worker1 azureuser]# cat /var/log/hadoop/hdfs/hadoop-hdfs-datanode-worker1.log 2016-01-05 02:24:22,601 INFO datanode.DataNode (LogAdapter.java:info(45)) - STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = worker1.012g3iyhe01upgbu35npgl5l4a.gx.internal.cloudapp.net/10.0.0.9 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.1.2.3.4.0-3485 . . . 2016-01-05 02:34:27,068 FATAL datanode.DataNode (DataNode.java:secureMain(2533)) - Exception in secureMain java.lang.IllegalArgumentException: Does not contain a valid host:port authority: %HOSTGROUP::master_2%:8020 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:198) at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:164) at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:153) at org.apache.hadoop.hdfs.DFSUtil.getAddressesForNameserviceId(DFSUtil.java:687) at org.apache.hadoop.hdfs.DFSUtil.getAddressesForNsIds(DFSUtil.java:655) at org.apache.hadoop.hdfs.DFSUtil.getNNServiceRpcAddressesForCluster(DFSUtil.java:872) at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.refreshNamenodes(BlockPoolManager.java:155) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1152) at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:430) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2411) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2298) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2345) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2526) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2550) 2016-01-05 02:34:27,072 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2016-01-05 02:34:27,076 INFO datanode.DataNode (LogAdapter.java:info(45)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at worker1.012g3iyhe01upgbu35npgl5l4a.gx.internal.cloudapp.net/10.0.0.9 ************************************************************/
Interestingly, the Ambari server log reports that the hostname mapping was successful for master nodes, but I didn't find it for worker nodes.
05 Jan 2016 02:13:40,271 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_5 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts. 05 Jan 2016 02:13:40,272 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_1 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts. 05 Jan 2016 02:13:40,273 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_2 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts. 05 Jan 2016 02:13:40,273 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_3 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts. 05 Jan 2016 02:13:40,274 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_4 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts.
(but even the master nodes had service startup failure)
Here's the config Blueprint Gist: https://gist.github.com/DhruvKumar/355af66897e584b...
And here's the cluster creation template: https://gist.github.com/DhruvKumar/9b971be81389317...
Here's the result of blueprint exported from Ambari server after installation (using /api/v1/clusters/clusterName?format=blueprint): https://gist.github.com/DhruvKumar/373cd7b05ca818c...
Edit: Ambari Server Log: https://gist.github.com/DhruvKumar/e2c06a94388c51e...
Note that my non-HA Blueprint which doesn't contain the HOSTNAME syntax works without an issue on the same infrastructure.
Can someone please help me debug why the hostnames aren't being mapped correctly? Is it a problem in the HA Blueprint? I have all the logs from the installation and I'll keep the cluster alive for debugging.
Thanks.
Created 01-05-2016 08:11 PM
It looks like you have oozie HA configured incorrectly.
From the ambari server log:
05 Jan 2016 02:13:40,471 ERROR [pool-2-thread-1] TopologyManager:553 - TopologyManager.ConfigureClusterTask: An exception occurred while attempting to process cluster configs and set on cluster: java.lang.IllegalArgumentException: Unable to update configuration property 'oozie.base.url' with topology information. Component 'OOZIE_SERVER' is mapped to an invalid number of hosts '2'.
Looking at the config processor code, it determines whether oozie HA is enabled by looking at the property
oozie-site/oozie.services.ext
To enable Oozie HA the property must be specified and must contain the following in it's value
org.apache.oozie.service.ZKLocksService
Because HA isn't properly configured, configuration processing fails because for non-HA environments, OOZIE_SERVER can only be mapped to a single host.
For Ambari Blueprint HA support please refer to:
Created 01-05-2016 06:53 PM
This example might help: https://github.com/uprush/ambari-blueprint-example...
Its seems to be referencing host_groups instead of hosts.
Search for below in the link above
%HOSTGROUP::host_group_master_1%
Created 01-05-2016 06:59 PM
Not sure if that matters. To the best of my knowledge, name of the host group can be anything, it is just a String, and the blueprint processor should substitute the correct hosts if they match up with the cluster creation template. See Sean's HA blueprint here which doesn't use the "host_groups" suffix:
Created 01-05-2016 07:06 PM
Makes sense...it should work then
@rnettleton any ideas on this one?
Created 01-05-2016 07:23 PM
@Dhruv Kumar Please provide the entire blueprint and cluster creation template so we can reproduce/debug.
Created 01-05-2016 07:27 PM
Hi John, blueprint and cluster creation template are linked from the question's description. Please see the links at the end of the description.
I've also added the Ambari Server log just now.
Created 01-05-2016 07:27 PM
Ambari Server log Gist:
Created 01-05-2016 07:29 PM
Ah, I see it now. Thanks.
Created 01-05-2016 07:32 PM
I believe if the Blueprint specified is using an external Postgresql Database,this means that Ambari cannot use hostname substitution.
Created 01-05-2016 07:41 PM
Hi Ancil - I'm not using an external db for Ambari. I setup Ambari using "ambari-server setup -s -j /path/to/jdk" which accepts all defaults and only uses my custom JDK path. The default Ambari server db is embedded Postgres.
The Blueprint Processor class is responsible for substituting hostnames and it should be just a string replace after the topology has been correctly resolved. So, not sure if choosing an external DB choice will affect it.