Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Cloudbreak stuck in a loop after attempting to repair a cluster

Explorer

We had attempted to repair a cluster after one of our nodes went into a bad state due to an issue with AWS. I ran the following command:

cb cluster repair --name <cluster> --host-groups <host_groups>

What I'm seeing now is that cloudbreak seems to be stick in a loop where we're seeing the following:

cloudbreak_1   | 2019-05-21 13:30:21,119 [reactorDispatcher-68] pollWithTimeout:32 INFO  c.s.c.s.PollingService - [owner:6476a4d7-bab8-4bf9-bfcd-aa6a43aa1d5f] [type:CLUSTER] [id:2] [name:emea-hdp] [flow:438c526a-325b-40c8-b86a-cc15aad4728a] [tracking:669a1784-a361-4509-9cd2-c57847a15cbb] Polling attempt 16277.
cloudbreak_1   | 2019-05-21 13:30:21,134 [reactorDispatcher-68] checkStatus:48 INFO  c.s.c.s.c.f.AmbariOperationsStatusCheckerTask - [owner:6476a4d7-bab8-4bf9-bfcd-aa6a43aa1d5f] [type:CLUSTER] [id:2] [name:<cluster>] [flow:438c526a-325b-40c8-b86a-cc15aad4728a] [tracking:669a1784-a361-4509-9cd2-c57847a15cbb] Ambari operation: 'Stopping components on the decommissioned hosts', Progress: 0
uluwatu_1      | 2019-05-21T13:30:21.141Z INFO [owner: ] [email: ] /notification endpoint:  {"eventType":"STOP_SERVICES_AMBARI_PROGRESS_STATE","eventTimestamp":1558445421137,"eventMessage":"0","owner":null,"account":null,"userIdV3":"email@email.com","cloud":"AWS","region":"eu-central-1","availabilityZone":null,"blueprintId":null,"blueprintName":null,"clusterId":2,"clusterName":"<cluster>","stackId":2,"stackName":"<cluster>","stackStatus":"AVAILABLE","nodeCount":null,"instanceGroup":null,"clusterStatus":"UPDATE_IN_PROGRESS","workspaceId":1}

cb cluster list shows the following:

[
  {
    "Name": "<cluster>",
    "Description": "",
    "CloudPlatform": "AWS",
    "StackStatus": "AVAILABLE",
    "ClusterStatus": "UPDATE_IN_PROGRESS"
  }
]

At this point we'd just like to stop the action and go back to a normal state. Any advice would be great.

4 REPLIES 4

Expert Contributor

Hi @Oliver Fox,

It looks like removing the node from Ambari is stuck. Could you check the Ambari UI/logs to see if it has any issues?

Explorer

The only thing that I've found that looks like an error is in the ambari-audit.log, nothing in the Ambari UI:

2019-05-22T13:14:03.946Z, User(null), RemoteIp(<IP>), Operation(User login), Roles(
), Status(Failed), Reason(Authentication required)
2019-05-22T13:14:03.947Z, User(cloudbreak), RemoteIp(<IP>), Operation(User login), Roles(
    Ambari: Ambari Administrator
), Status(Success)

The EC2 node that was causing problems is actually in a good state now, it doesn't need to be removed.

Expert Contributor

if you click on the background operation icon (gear on the right upper corner) in Ambari, do you see a job called "Stop all components on host"?
108921-1558600986041.png

This is what CB is waiting for according to the logs.

Also you can try restart CB, it may get out if the loop.

Explorer

There are completed background operations, none for "Stop All Components on hosts" and no pending operations.

We ended up restarting CB as suggested, it did complete and removed the node and added a new one to the cluster.

Thanks for the help.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.