Support Questions
Find answers, ask questions, and share your expertise

Cluster Down - HostCleanup deleted service Jars

Cluster Down - HostCleanup deleted service Jars

New Contributor

<This may be a re-post. I originally posted in Hadoop Core but my profile is showing 0 questions.>

Hello,

Thank you in advance for any assistance you're able to provide. Over the past few days I have been getting notifications that disk space (OS) on one of my nodes was over 50%. This has happened before and a simple zip-up of the log files will work. In this case I found that I had 10GB of drive space utilized by old HDP versions (after several upgrades). I googled around for some clever solutions and came across this post which suggests using HostCleanup.py (DON'T DO THIS). I first used the --help flag on the script to see what it was about. Descriptions were limited. I did however see that '--silent' mode skips all prompts. I removed this for my run thinking that I could stop the script if there was something I didn't like. Instead the script ran straight through and deleted all service directories in /etc and silently (no logging) uninstalled many of the service jars/binaries and deleted many/all of the symlinks for them. This all happened on the standby namenode.

Most services shutdown immediately because there were 5 core services on this node. I now have a semi-production cluster with terabytes of data in an unusable state. I attempted to migrate the data off to a different cluster but my primary namenode, after iterations of install and service movement attempts, ended up in 'standby.' I desperately need help to restore services on this cluster. Below are the current state and what I believe to be possible next steps.

CURRENT STATE:

NN1 is in 'standby' and NN2 (the victim) will not start. NN2 reports 'NameNode not initialized.'

Most services are down and will not restart without a valid/responsive NN.

I cannot backup the nameSpace without a responsive namenode.

I cannot list, export, move, or otherwise interact with HDFS without a responsive namenode.

POTENTIAL STEPS:

1. Somehow get NN2 to start, export/backup all data, and take further action.

2. Magically restore all services and config files on NN2 following the mass-deletion.

3. Wipe it all, restore, try to recoup data by other aggregation methods, hope nobody notices.

4. Something else.

I am very much the 'long time listener, first time caller' and really appreciate any help you good folks can offer. I don't like to be the guy with the problem - but I sure hope someone ends up finding this thread before the original one which caused me to act carelessly and aggressively with scripts I was not familiar with.

3 REPLIES 3

Re: Cluster Down - HostCleanup deleted service Jars

New Contributor

Further Detail:

Last upgrade was from 2.3.0.0 to 2.4.2, which is current version. Cluster is HA-enabled for all capable services.

Re: Cluster Down - HostCleanup deleted service Jars

Guru

@Rob C

Can you try "hdfs haadmin -transitionToActive nn1" from the standby namenode (nn1) and see if you are able to bring up the the namenode.

Re: Cluster Down - HostCleanup deleted service Jars

New Contributor

Thanks @srai,

I had tried this previously but nn1 will not transition due to nn2 being non-responsive. I should note that I had some success with YARN on nn2 before nn1 transitioned to standby by re-installing the packages for app_timeline_server on nn2, copying core-site.xml from nn1to nn2, and then using Ambari to move that service to a different host. I'm open to similar ideas for the other master services still on nn2 but I think I have to get nn1 to be active first.