Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Tasks are stucks at 9% after kerberizing HDP3.1 cluster

Explorer

I have an HDP-3.1.0.0, Ambari-managed cluster on AWS. I just successfully ran the Kerberos wizard, as well as the sync-ldap command from the CLI with an existing Active Directory.

After completing the UI Kerberos Wizard, The zookeeper service on the master node failed to start. After that, i can manually start the services on some of the nodes, but not others. Once i issue the start/restart command from the UI, the operation will get stuck at 9% and it will not show any output.

After googling a bit, i found this question, and tried all suggested fixes (memory is more than enough, and there are no STOPPING hostcomponentstate rows). I restarted the agent and server processes on all nodes. Even issues a reboot on the whole cluster (Services are set to auto-start). When back on the UI, some services will be up but others will not. Issuing start commands work for a while. After one fails, it will keep failing and stop showing progress/output.

Looking at ambari-server logs, i get these suspicious warnings:

2019-01-03 19:08:09,423  WARN [agent-report-processor-0] ActionManager:155 - The task 1546517134 is invalid
2019-01-03 19:08:09,680  WARN [agent-report-processor-0] HeartbeatProcessor:358 - Can't fetch HostRoleCommand with taskId = 1546517135
2019-01-03 19:08:09,680  WARN [agent-report-processor-0] ActionManager:155 - The task 1546517135 is invalid
2019-01-03 19:08:15,989  WARN [agent-report-processor-2] HeartbeatProcessor:358 - Can't fetch HostRoleCommand with taskId = 1546517218
2019-01-03 19:08:15,990  INFO [agent-report-processor-2] ServiceComponentHostImpl:1054 - Host role transitioned to a new state, serviceComponentName=RANGER_TAGSYNC, hostName=worker3.devbigdata.spendhq.net, oldState=INSTALLED, currentState=STARTED
2019-01-03 19:08:15,991  WARN [agent-report-processor-2] ActionManager:155 - The task 1546517218 is invalid
2019-01-03 19:08:16,207  WARN [agent-report-processor-2] HeartbeatProcessor:358 - Can't fetch HostRoleCommand with taskId = 1546517219
2019-01-03 19:08:16,207  WARN [agent-report-processor-2] ActionManager:155 - The task 1546517219 is invalid

And ambari agents on all hosts show lines like these:

INFO 2019-01-03 19:25:37,747 __init__.py:82 - Event from server at /user/ (correlation_id=4683): {u'status': u'OK', u'id': 2510}
INFO 2019-01-03 19:25:38,521 ComponentStatusExecutor.py:183 - Status for DATANODE has changed to INSTALLED
INFO 2019-01-03 19:25:38,521 RecoveryManager.py:174 - current status is set to INSTALLED for DATANODE
INFO 2019-01-03 19:25:38,728 security.py:135 - Event to server at /reports/component_status (correlation_id=4684): {'clusters': defaultdict(<function <lambda> at 0x7f9a3c569410>, {u'2': [{'status': 'INSTALLED', 'componentName': u'ZOOKEEPER_SERVER', 'serviceName': u'ZOOKEEPER', 'clusterId': u'2', 'command': u'STATUS'}, {'status': 'INSTALLED', 'componentName': u'SECONDARY_NAMENODE', 'serviceName': u'HDFS', 'clusterId': u'2', 'command': u'STATUS'}, {'status': 'INSTALLED', 'componentName': u'DATANODE', 'serviceName': u'HDFS', 'clusterId': u'2', 'command': u'STATUS'}]})}
INFO 2019-01-03 19:25:38,730 __init__.py:82 - Event from server at /user/ (correlation_id=4684): {u'status': u'OK'}

Any help is greatly appreciated!

3 REPLIES 3

Hi @Leonel Atencio ,

Looking at the attached Ambari-server warnings it seems some task is not in completed in ambari database which might be the reason for this error.

Can you do the following things and see if it helps

1) Stop ambari server and login to ambari database

2) execute the following commands :

mysql> select distinct(status) from host_role_command; (figure out the status that is not in ABORTED , COMPLETED and FAILED STATE and inspect those tasks and try to make it ABORTED ) 
mysql> select status,start_time,end_time from host_role_command where task_id='1546517218';
//If my Assumption is right the output of status will be PENDING,PENDING_HOLDING and end_time will be -1 , update the status of this task to aborted then
mysql> update host_role_command set status='ABORTED' where task_id='1546517218';

If my assumption is correct After a restart of ambari server, your tasks wont be stuck in 9%.

Please see if this helps and please login and accept this answer if it did .

note : Please take a database backup before DB operations for your safety and and recovery if you did anything wrong,

How to take Ambari DB dump (sample commands if DB name is ambari and username is ambari )  :
If PostgreSQL:
pg_dump -U ambari ambari > ./ambari_$(date +"%Y%m%d%H%M%S").sql
If MySQL:
mysqldump -u ambari -p ambari > ./ambari_$(date +"%Y%m%d%H%M%S").sql 

Explorer

Hello @Akhil S Naik thanks for your reply.

I did query host_role_commands and unfortunately, i only got "ABORTED" ,"COMPLETED" and "FAILED" states.

What's more interesting is that the biggest task_id on that table is "977". Idon't really know where that 1546517218 on the logs is coming from.

Kindly let me know if you have additional thoughts, please.

Explorer
HI  Akhil S Naik, @Leonel Atencio , @Geoffrey Shelton Okot
I have the same issue and try to put the whole state in INSTALLED, then restart the ambari server, but the services appear in red mode.

how did they solve, did they try something else?

Stay tuned to your comments

regards

					
				
			
			
				
			
			
			
			
			
			
			
		
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.