Created 05-21-2019 03:58 PM
We had attempted to repair a cluster after one of our nodes went into a bad state due to an issue with AWS. I ran the following command:
cb cluster repair --name <cluster> --host-groups <host_groups>
What I'm seeing now is that cloudbreak seems to be stick in a loop where we're seeing the following:
cloudbreak_1 | 2019-05-21 13:30:21,119 [reactorDispatcher-68] pollWithTimeout:32 INFO c.s.c.s.PollingService - [owner:6476a4d7-bab8-4bf9-bfcd-aa6a43aa1d5f] [type:CLUSTER] [id:2] [name:emea-hdp] [flow:438c526a-325b-40c8-b86a-cc15aad4728a] [tracking:669a1784-a361-4509-9cd2-c57847a15cbb] Polling attempt 16277. cloudbreak_1 | 2019-05-21 13:30:21,134 [reactorDispatcher-68] checkStatus:48 INFO c.s.c.s.c.f.AmbariOperationsStatusCheckerTask - [owner:6476a4d7-bab8-4bf9-bfcd-aa6a43aa1d5f] [type:CLUSTER] [id:2] [name:<cluster>] [flow:438c526a-325b-40c8-b86a-cc15aad4728a] [tracking:669a1784-a361-4509-9cd2-c57847a15cbb] Ambari operation: 'Stopping components on the decommissioned hosts', Progress: 0 uluwatu_1 | 2019-05-21T13:30:21.141Z INFO [owner: ] [email: ] /notification endpoint: {"eventType":"STOP_SERVICES_AMBARI_PROGRESS_STATE","eventTimestamp":1558445421137,"eventMessage":"0","owner":null,"account":null,"userIdV3":"email@email.com","cloud":"AWS","region":"eu-central-1","availabilityZone":null,"blueprintId":null,"blueprintName":null,"clusterId":2,"clusterName":"<cluster>","stackId":2,"stackName":"<cluster>","stackStatus":"AVAILABLE","nodeCount":null,"instanceGroup":null,"clusterStatus":"UPDATE_IN_PROGRESS","workspaceId":1}
cb cluster list shows the following:
[ { "Name": "<cluster>", "Description": "", "CloudPlatform": "AWS", "StackStatus": "AVAILABLE", "ClusterStatus": "UPDATE_IN_PROGRESS" } ]
At this point we'd just like to stop the action and go back to a normal state. Any advice would be great.
Created 05-22-2019 11:43 AM
Hi @Oliver Fox,
It looks like removing the node from Ambari is stuck. Could you check the Ambari UI/logs to see if it has any issues?
Created 05-22-2019 01:18 PM
The only thing that I've found that looks like an error is in the ambari-audit.log, nothing in the Ambari UI:
2019-05-22T13:14:03.946Z, User(null), RemoteIp(<IP>), Operation(User login), Roles( ), Status(Failed), Reason(Authentication required) 2019-05-22T13:14:03.947Z, User(cloudbreak), RemoteIp(<IP>), Operation(User login), Roles( Ambari: Ambari Administrator ), Status(Success)
The EC2 node that was causing problems is actually in a good state now, it doesn't need to be removed.
Created on 05-23-2019 08:46 AM - edited 08-17-2019 03:23 PM
if you click on the background operation icon (gear on the right upper corner) in Ambari, do you see a job called "Stop all components on host"?
This is what CB is waiting for according to the logs.
Also you can try restart CB, it may get out if the loop.
Created 05-23-2019 01:46 PM
There are completed background operations, none for "Stop All Components on hosts" and no pending operations.
We ended up restarting CB as suggested, it did complete and removed the node and added a new one to the cluster.
Thanks for the help.