We have an issue whereby we are unable to start some of the services via Ambari. Tried "Start All" and manual start for each service, but no luck. It was all working fine previously before there was a OS level reboot, after which services are not coming up. Below are the details, could anyone assist to fix this?
The services which are running up & fine are:
1. App timeline server - YARN
2. History Server - MapReduce2
3. HiveServer2 - Hive
4. Infra Solr Instance - Ambari Infra
5. Metrics Controller - Ambari Metrics
6. Grafana - Ambari Metrics
7. MySQL server - Hive
8. NameNode - HDFS
9. ResourceManager - YARN
10. SNameNode - HDFS
11. ZooKeeper Server - ZooKeeper
12. DataNode - HDFS
13. Metrics Monitor - Ambari Metrics
14. NFSGateway - HDFS
15. NodeManager - YARN
The services which are failing to start are:
1. HBase Master - HBase
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/hdp/current/hbase-master/bin/hbase-daemon.sh --config /usr/hdp/current/hbase-master/conf start master' returned 127. -bash: /usr/hdp/current/hbase-master/bin/hbase-daemon.sh: No such file or directory
2. Hive Metastore - Hive
resource_management.core.exceptions.ExecutionFailed: Execution of 'export HIVE_CONF_DIR=/usr/hdp/current/hive-metastore/conf/conf.server ; /usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql -userName hive -passWord [PROTECTED] -verbose' returned 3. Missing Hive CLI Jar
3. Knox Gateway - Knox
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/hdp/current/knox-server/bin/knoxcli.sh create-master --master [PROTECTED]' returned 127. -bash: /usr/hdp/current/knox-server/bin/knoxcli.sh: No such file or directory
4. Oozie Server - Oozie
resource_management.core.exceptions.ExecutionFailed: Execution of 'cd /var/tmp/oozie && /usr/hdp/current/oozie-server/bin/oozie-start.sh' returned 127. -bash: /usr/hdp/current/oozie-server/bin/oozie-start.sh: No such file or directory
5. Ranger Admin - Ranger
resource_management.core.exceptions.ExecutionFailed: Execution of 'cp -f /usr/hdp/current/ranger-admin/ews/webapp/WEB-INF/classes/conf.dist/ranger-admin-default-site.xml /usr/hdp/current/ranger-admin/conf/ranger-admin-default-site.xml' returned 1. cp: cannot stat '/usr/hdp/current/ranger-admin/ews/webapp/WEB-INF/classes/conf.dist/ranger-admin-default-site.xml': No such file or directory
6. Ranger KMS Server - Ranger KMS
resource_management.core.exceptions.Fail: Applying Directory['/usr/hdp/current/ranger-kms/ews/webapp/WEB-INF/classes/lib'] failed, parent directory /usr/hdp/current/ranger-kms/ews/webapp/WEB-INF/classes doesn't exist
7. Ranger UserSync - Ranger
resource_management.core.exceptions.Fail: Applying File['/usr/hdp/current/ranger-usersync/conf/ranger-usersync-env-logdir.sh'] failed, parent directory /usr/hdp/current/ranger-usersync/conf doesn't exist
8. Spark History Server - Spark
resource_management.core.exceptions.Fail: Applying File['/usr/hdp/current/spark-historyserver/conf/spark-defaults.conf'] failed, parent directory /usr/hdp/current/spark-historyserver/conf doesn't exist
9. SparkController - SparkController
Starting HANA Spark Controller ... Class path is /usr/hdp/126.96.36.199-3485/spark/lib/spark-assembly-188.8.131.52.3.4.0-3485-hadoop184.108.40.206.3.4.0-3485.jar::/usr/hdp/220.127.116.11-3485/hadoop/*:/usr/hdp/18.104.22.168-3485/hadoop/lib/*:/usr/hdp/22.214.171.124-3485/hadoop-mapreduce/*:/usr/hdp/126.96.36.199-3485/hadoop-mapreduce/lib/*:/usr/hdp/188.8.131.52-3485/hadoop-yarn/*:/usr/hdp/184.108.40.206-3485/hadoop-yarn/lib/*:/usr/hdp/220.127.116.11-3485/hadoop-hdfs/*:/usr/hdp/18.104.22.168-3485/hive/lib/*:mysql-connector-java.jar:/usr/sap/spark/controller/bin/../conf:/etc/hadoop/conf:/etc/hive/conf:../*:../lib/*:/usr/hdp/*:/usr/hdp/lib/*:/*:/lib/* ./hanaes: line 105: /var/run/hanaes/hana.spark.controller: No such file or directory FAILED TO WRITE PID
10. WebHCat Server - Hive
resource_management.core.exceptions.ExecutionFailed: Execution of 'cd /var/run/webhcat ; /usr/hdp/current/hive-webhcat/sbin/webhcat_server.sh start' returned 127. -bash: /usr/hdp/current/hive-webhcat/sbin/webhcat_server.sh: No such file or directory
11. RegionServer - HBase
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/current/hbase-regionserver/conf start regionserver' returned 127. -bash: /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: No such file or directory
12. Spark Thift Server - Spark
resource_management.core.exceptions.Fail: Applying File['/usr/hdp/current/spark-thriftserver/conf/spark-defaults.conf'] failed, parent directory /usr/hdp/current/spark-thriftserver/conf doesn't exist
@Hardeep Singh Most of the errors are caused due missing file/doesn't exist, did you check if the files were there at the os level?
ls -l /usr ls -l /usr/hdp ls -l /usr/hdp/current ls -l /usr/hdp/current/spark-thriftserver/conf ls -l /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh
Upon troubleshooting further, found out that some of the files/folders have gone missing especially conf and bin (same as what Felix has also commented). This seems to be the root cause of services not getting started. However, still we're not sure what has caused these files to disappear, quite strange. In order to fix this, there are few options in our mind:
1. Copy the bin/conf missing files from working environment to the affected environment. Then try starting the services.
2. Remove/uninstall the affected services and then re-install them from Ambari. But, this will overwrite existing service configurations?
3. Upgrade the entire HDP version. But, will the upgrade fail if we've missing files in existing version? Or, the upgrade will be independent of existing files and will copy the new files during installation?
Thanks guys for your responses so far, appreciate it.