Unable to start Hadoop Services from Ambari Console

Hello Experts,

We have an issue whereby we are unable to start some of the services via Ambari. Tried "Start All" and manual start for each service, but no luck. It was all working fine previously before there was a OS level reboot, after which services are not coming up. Below are the details, could anyone assist to fix this?

The services which are running up & fine are:

1. App timeline server - YARN

2. History Server - MapReduce2

3. HiveServer2 - Hive

4. Infra Solr Instance - Ambari Infra

5. Metrics Controller - Ambari Metrics

6. Grafana - Ambari Metrics

7. MySQL server - Hive

8. NameNode - HDFS

9. ResourceManager - YARN

10. SNameNode - HDFS

11. ZooKeeper Server - ZooKeeper

12. DataNode - HDFS

13. Metrics Monitor - Ambari Metrics

14. NFSGateway - HDFS

15. NodeManager - YARN

The services which are failing to start are:

1. HBase Master - HBase

resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/hdp/current/hbase-master/bin/ --config /usr/hdp/current/hbase-master/conf start master' returned 127. -bash: /usr/hdp/current/hbase-master/bin/ No such file or directory

2. Hive Metastore - Hive

resource_management.core.exceptions.ExecutionFailed: Execution of 'export HIVE_CONF_DIR=/usr/hdp/current/hive-metastore/conf/conf.server ; /usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql -userName hive -passWord [PROTECTED] -verbose' returned 3. Missing Hive CLI Jar

3. Knox Gateway - Knox

resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/hdp/current/knox-server/bin/ create-master --master [PROTECTED]' returned 127. -bash: /usr/hdp/current/knox-server/bin/ No such file or directory

4. Oozie Server - Oozie

resource_management.core.exceptions.ExecutionFailed: Execution of 'cd /var/tmp/oozie && /usr/hdp/current/oozie-server/bin/' returned 127. -bash: /usr/hdp/current/oozie-server/bin/ No such file or directory

5. Ranger Admin - Ranger

resource_management.core.exceptions.ExecutionFailed: Execution of 'cp -f /usr/hdp/current/ranger-admin/ews/webapp/WEB-INF/classes/conf.dist/ranger-admin-default-site.xml /usr/hdp/current/ranger-admin/conf/ranger-admin-default-site.xml' returned 1. cp: cannot stat '/usr/hdp/current/ranger-admin/ews/webapp/WEB-INF/classes/conf.dist/ranger-admin-default-site.xml': No such file or directory

6. Ranger KMS Server - Ranger KMS

resource_management.core.exceptions.Fail: Applying Directory['/usr/hdp/current/ranger-kms/ews/webapp/WEB-INF/classes/lib'] failed, parent directory /usr/hdp/current/ranger-kms/ews/webapp/WEB-INF/classes doesn't exist

7. Ranger UserSync - Ranger

resource_management.core.exceptions.Fail: Applying File['/usr/hdp/current/ranger-usersync/conf/'] failed, parent directory /usr/hdp/current/ranger-usersync/conf doesn't exist

8. Spark History Server - Spark

resource_management.core.exceptions.Fail: Applying File['/usr/hdp/current/spark-historyserver/conf/spark-defaults.conf'] failed, parent directory /usr/hdp/current/spark-historyserver/conf doesn't exist

9. SparkController - SparkController

Starting HANA Spark Controller ...  Class path is /usr/hdp/*:/usr/hdp/*:/usr/hdp/*:/usr/hdp/*:/usr/hdp/*:/usr/hdp/*:/usr/hdp/*:/usr/hdp/*:mysql-connector-java.jar:/usr/sap/spark/controller/bin/../conf:/etc/hadoop/conf:/etc/hive/conf:../*:../lib/*:/usr/hdp/*:/usr/hdp/lib/*:/*:/lib/*
./hanaes: line 105: /var/run/hanaes/hana.spark.controller: No such file or directory

10. WebHCat Server - Hive

resource_management.core.exceptions.ExecutionFailed: Execution of 'cd /var/run/webhcat ; /usr/hdp/current/hive-webhcat/sbin/ start' returned 127. -bash: /usr/hdp/current/hive-webhcat/sbin/ No such file or directory

11. RegionServer - HBase

resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/hdp/current/hbase-regionserver/bin/ --config /usr/hdp/current/hbase-regionserver/conf start regionserver' returned 127. -bash: /usr/hdp/current/hbase-regionserver/bin/ No such file or directory

12. Spark Thift Server - Spark

resource_management.core.exceptions.Fail: Applying File['/usr/hdp/current/spark-thriftserver/conf/spark-defaults.conf'] failed, parent directory /usr/hdp/current/spark-thriftserver/conf doesn't exist

@Hardeep Singh Most of the errors are caused due missing file/doesn't exist, did you check if the files were there at the os level?

ls -l /usr
ls -l /usr/hdp
ls -l /usr/hdp/current
ls -l /usr/hdp/current/spark-thriftserver/conf
ls -l /usr/hdp/current/hbase-regionserver/bin/

@Hardeep Singh

Check whether all the disk partition exists after the reboot.


What was your OS level change?

Was there a permission change?

Upon troubleshooting further, found out that some of the files/folders have gone missing especially conf and bin (same as what Felix has also commented). This seems to be the root cause of services not getting started. However, still we're not sure what has caused these files to disappear, quite strange. In order to fix this, there are few options in our mind:

1. Copy the bin/conf missing files from working environment to the affected environment. Then try starting the services.

2. Remove/uninstall the affected services and then re-install them from Ambari. But, this will overwrite existing service configurations?

3. Upgrade the entire HDP version. But, will the upgrade fail if we've missing files in existing version? Or, the upgrade will be independent of existing files and will copy the new files during installation?

Thanks guys for your responses so far, appreciate it.



