Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

can't delete bad node from the cluster

avatar

64-node cluster

one node is bad..no longer communicates 

I want to remove him from the cluster

 

ch-8 10.71.0.108 /default CDH 5 Cluster 1

2 Role(s)

Good Health 12.38s ago
ch-9 10.71.0.109 /default Unknown Cluster 1

2 Role(s)

Bad Health None

 

here he is in the hosts. he's a data node and a nodemanager (yarn)

 

When I try to delete, it tells me

Delete Hosts The following 1 host(s) cannot be deleted because they have role instances or are not completely decommissioned:

Host Role Instances
ch-9 nodemanager (ch-9) and 1 other role(s).

If I try this, it doesn't work Remove Hosts From Cluster

Removing these hosts will stop and delete all roles running on them and then remove them from their clusters. The hosts will still be managed by Cloudera Manager and can be utilized after being added to new or existing clusters.
Role data directories will not be deleted.

Host Role

ch-9 NodeManager, DataNode
Decommission Roles (Warning: Removing the hosts without decomissioning the roles running on them can result in permanent data loss.) Skip Management Roles   
 
 
Command Details: Hosts Decommission
Command Context Status Started at Ended at
Hosts Decommission Finished Mar 31, 2014 1:09:28 PM PDT Mar 31, 2014 1:09:28 PM PDT
Command 'DecommissionWithWait' failed for service 'yarn'
     
Child Commands
All   Failed Only   Active Only
Command (Child commands) Context Status Started at Ended at
Decommission (2) YARN (MR2 Included) Finished Mar 31, 2014 1:09:28 PM PDT Mar 31, 2014 1:09:28 PM PDT
Failed to perform decommission.


 
 

 Basically, if I can't talk to the node, I can't stop/decommission/delete him How should I do it?

1 ACCEPTED SOLUTION

avatar
The decommission step may have done the same thing as the stop command I suggested. If this happens again, I'd try the decommission command, let it fail, then delete host. If that doesn't work, then try my stop suggestion.

View solution in original post

5 REPLIES 5

avatar
Try stopping all roles on that host (via CM), then removing it. The stop commands will fail, but they'll mark the state as stopped so it'll let you remove it from the cluster.

avatar

thanks for the quick response.

I thought I had tried that...but

I was showing someone else the problem and told him to just try it and he managed to delete the node just doing what I had done that failed. So maybe the node got in the desired state.

 

in any case, I can't try your recommendation immediately. 

However I think I'll get nodes into a bad state again sometime soon and will try what you recommend

 

thanks again

(sorry I can't confirm exactly right now, but my node is gone now which is good.)

 

-kevin

avatar
The decommission step may have done the same thing as the stop command I suggested. If this happens again, I'd try the decommission command, let it fail, then delete host. If that doesn't work, then try my stop suggestion.

avatar
Explorer

Remove the dead host/decommissioned host from mammoth -c output or CM.

 

We have already deleted the host.

As we are about to start the upgrade process from 5.14.2 to 6.0. So as a prerequisites,

When running ./mammoth -c it is giving information about the hosts which is not part of cluster. We are also thinking to remove it from scm database table hosts On mysql, under scm database, also I am able to see :

 

mysql> mysql> select * from HOSTS; +---------+-------------------------+--------------------------------------+-----------------------------+---------------+----------+--------+- | HOST_ID | OPTIMISTIC_LOCK_VERSION | HOST_IDENTIFIER | NAME | IP_ADDRESS | RACK_ID | STATUS | +---------+-------------------------+--------------------------------------+-----------------------------+---------------+----------+--------+- | 1 | 248 | 260772a1-a89a-42b8-af4c-0406ac0c21bd | bdk1n07.bnet.luxds.net | 192.168.11.16 | /default | NA | | 2 | 251 | 19103582-a94d-4961-aeb8-5a2023480fa5 | bdk1n09.bnet.luxds.net | 192.168.11.18 | /default | NA | | 3 | 254 | e57f3aa9-ab4f-4b3c-925d-2be272237928 | bdk1n08.bnet.luxds.net | 192.168.11.17 | /default | NA | | 4 | 89 | 0317c86d-b693-4280-ba25-0bbcc46e567c | xl11lsrv0428.bnet.luxds.net | 10.178.65.98 | /default | NA | +---------+-------------------------+--------------------------------------+-----------------------------+---------------+----------+--------+-

One with hostId "0317c86d-b693-4280-ba25-0bbcc46e567c"(which was edge node before) is removed from cloudera, so is there anyway to clean this node from CM, because on the screen of cloudera - hosts I am just able to see 3 nodes.

 

Is that server xl11lsrv0428.bnet.luxds.net | 10.178.65.98 still running separately? 
it is running seperately and even re-imaged.

Is CM agent still running /stopped on the server xl11lsrv0428.bnet.luxds.net | 10.178.65.98? 
No CM agent is working on it currently

Is it showing in CM portal? 
Ans. on the CM, no entry as xl11lsrv0428.

avatar
Master Guru

Hi @pra_big,

 

Please do not add onto a solved thread from 5 years ago.  It is very unlikely that the current issue you face is identical so it is best to start a new conversation.

 

Please outline what you are trying to do, what you expect to have happen, and what is actually occurring.

 

From your description, it appears you are running a script that may be an Oracle script (mammoth).  That is not a Cloudera Script, so please consult with the vendor that supplied you with "mammoth" if you are need assistance with it.

 

It is hard to tell what you are asking about with respect to the host in Cloudera Manager... if you want to delete a host in CM, Go to the Hosts tab, select "All Hosts".  Then, find the host you wish to delete, check the box next to it and then choose "delete" from the drop-down menu.

 

maybe you could show screen shots or explain more about what you are having trouble with.

NOTE:  when the Cloudera Manager Agent heartbeats to CM, CM identifies the host by "uuid" not hostname.  So, if you re-imaged and accidentally reused a UUID from another host, that could lead to some confusion.

 

We need to clearly understand what problem you are seeing to provide the best help.