Created 01-03-2019 09:06 PM
I have an HDP-3.1.0.0, Ambari-managed cluster on AWS. I just successfully ran the Kerberos wizard, as well as the sync-ldap command from the CLI with an existing Active Directory.
After completing the UI Kerberos Wizard, The zookeeper service on the master node failed to start. After that, i can manually start the services on some of the nodes, but not others. Once i issue the start/restart command from the UI, the operation will get stuck at 9% and it will not show any output.
After googling a bit, i found this question, and tried all suggested fixes (memory is more than enough, and there are no STOPPING hostcomponentstate rows). I restarted the agent and server processes on all nodes. Even issues a reboot on the whole cluster (Services are set to auto-start). When back on the UI, some services will be up but others will not. Issuing start commands work for a while. After one fails, it will keep failing and stop showing progress/output.
Looking at ambari-server logs, i get these suspicious warnings:
2019-01-03 19:08:09,423 WARN [agent-report-processor-0] ActionManager:155 - The task 1546517134 is invalid 2019-01-03 19:08:09,680 WARN [agent-report-processor-0] HeartbeatProcessor:358 - Can't fetch HostRoleCommand with taskId = 1546517135 2019-01-03 19:08:09,680 WARN [agent-report-processor-0] ActionManager:155 - The task 1546517135 is invalid 2019-01-03 19:08:15,989 WARN [agent-report-processor-2] HeartbeatProcessor:358 - Can't fetch HostRoleCommand with taskId = 1546517218 2019-01-03 19:08:15,990 INFO [agent-report-processor-2] ServiceComponentHostImpl:1054 - Host role transitioned to a new state, serviceComponentName=RANGER_TAGSYNC, hostName=worker3.devbigdata.spendhq.net, oldState=INSTALLED, currentState=STARTED 2019-01-03 19:08:15,991 WARN [agent-report-processor-2] ActionManager:155 - The task 1546517218 is invalid 2019-01-03 19:08:16,207 WARN [agent-report-processor-2] HeartbeatProcessor:358 - Can't fetch HostRoleCommand with taskId = 1546517219 2019-01-03 19:08:16,207 WARN [agent-report-processor-2] ActionManager:155 - The task 1546517219 is invalid
And ambari agents on all hosts show lines like these:
INFO 2019-01-03 19:25:37,747 __init__.py:82 - Event from server at /user/ (correlation_id=4683): {u'status': u'OK', u'id': 2510} INFO 2019-01-03 19:25:38,521 ComponentStatusExecutor.py:183 - Status for DATANODE has changed to INSTALLED INFO 2019-01-03 19:25:38,521 RecoveryManager.py:174 - current status is set to INSTALLED for DATANODE INFO 2019-01-03 19:25:38,728 security.py:135 - Event to server at /reports/component_status (correlation_id=4684): {'clusters': defaultdict(<function <lambda> at 0x7f9a3c569410>, {u'2': [{'status': 'INSTALLED', 'componentName': u'ZOOKEEPER_SERVER', 'serviceName': u'ZOOKEEPER', 'clusterId': u'2', 'command': u'STATUS'}, {'status': 'INSTALLED', 'componentName': u'SECONDARY_NAMENODE', 'serviceName': u'HDFS', 'clusterId': u'2', 'command': u'STATUS'}, {'status': 'INSTALLED', 'componentName': u'DATANODE', 'serviceName': u'HDFS', 'clusterId': u'2', 'command': u'STATUS'}]})} INFO 2019-01-03 19:25:38,730 __init__.py:82 - Event from server at /user/ (correlation_id=4684): {u'status': u'OK'}
Any help is greatly appreciated!
Created 01-04-2019 03:14 AM
Hi @Leonel Atencio ,
Looking at the attached Ambari-server warnings it seems some task is not in completed in ambari database which might be the reason for this error.
Can you do the following things and see if it helps
1) Stop ambari server and login to ambari database
2) execute the following commands :
mysql> select distinct(status) from host_role_command; (figure out the status that is not in ABORTED , COMPLETED and FAILED STATE and inspect those tasks and try to make it ABORTED ) mysql> select status,start_time,end_time from host_role_command where task_id='1546517218'; //If my Assumption is right the output of status will be PENDING,PENDING_HOLDING and end_time will be -1 , update the status of this task to aborted then mysql> update host_role_command set status='ABORTED' where task_id='1546517218';
If my assumption is correct After a restart of ambari server, your tasks wont be stuck in 9%.
Please see if this helps and please login and accept this answer if it did .
note : Please take a database backup before DB operations for your safety and and recovery if you did anything wrong,
How to take Ambari DB dump (sample commands if DB name is ambari and username is ambari ) : If PostgreSQL: pg_dump -U ambari ambari > ./ambari_$(date +"%Y%m%d%H%M%S").sql If MySQL: mysqldump -u ambari -p ambari > ./ambari_$(date +"%Y%m%d%H%M%S").sql
Created 01-04-2019 08:50 PM
Hello @Akhil S Naik thanks for your reply.
I did query host_role_commands and unfortunately, i only got "ABORTED" ,"COMPLETED" and "FAILED" states.
What's more interesting is that the biggest task_id on that table is "977". Idon't really know where that 1546517218 on the logs is coming from.
Kindly let me know if you have additional thoughts, please.
Created 03-04-2019 09:03 PM
HI Akhil S Naik, @Leonel Atencio , @Geoffrey Shelton Okot
I have the same issue and try to put the whole state in INSTALLED, then restart the ambari server, but the services appear in red mode. how did they solve, did they try something else? Stay tuned to your comments regards