<This may be a re-post. I originally posted in Hadoop Core but my profile is showing 0 questions.>
Thank you in advance for any assistance you're able to provide. Over the past few days I have been getting notifications that disk space (OS) on one of my nodes was over 50%. This has happened before and a simple zip-up of the log files will work. In this case I found that I had 10GB of drive space utilized by old HDP versions (after several upgrades). I googled around for some clever solutions and came across this post which suggests using HostCleanup.py (DON'T DO THIS). I first used the --help flag on the script to see what it was about. Descriptions were limited. I did however see that '--silent' mode skips all prompts. I removed this for my run thinking that I could stop the script if there was something I didn't like. Instead the script ran straight through and deleted all service directories in /etc and silently (no logging) uninstalled many of the service jars/binaries and deleted many/all of the symlinks for them. This all happened on the standby namenode.
Most services shutdown immediately because there were 5 core services on this node. I now have a semi-production cluster with terabytes of data in an unusable state. I attempted to migrate the data off to a different cluster but my primary namenode, after iterations of install and service movement attempts, ended up in 'standby.' I desperately need help to restore services on this cluster. Below are the current state and what I believe to be possible next steps.
NN1 is in 'standby' and NN2 (the victim) will not start. NN2 reports 'NameNode not initialized.'
Most services are down and will not restart without a valid/responsive NN.
I cannot backup the nameSpace without a responsive namenode.
I cannot list, export, move, or otherwise interact with HDFS without a responsive namenode.
1. Somehow get NN2 to start, export/backup all data, and take further action.
2. Magically restore all services and config files on NN2 following the mass-deletion.
3. Wipe it all, restore, try to recoup data by other aggregation methods, hope nobody notices.
4. Something else.
I am very much the 'long time listener, first time caller' and really appreciate any help you good folks can offer. I don't like to be the guy with the problem - but I sure hope someone ends up finding this thread before the original one which caused me to act carelessly and aggressively with scripts I was not familiar with.
Last upgrade was from 220.127.116.11 to 2.4.2, which is current version. Cluster is HA-enabled for all capable services.
I had tried this previously but nn1 will not transition due to nn2 being non-responsive. I should note that I had some success with YARN on nn2 before nn1 transitioned to standby by re-installing the packages for app_timeline_server on nn2, copying core-site.xml from nn1to nn2, and then using Ambari to move that service to a different host. I'm open to similar ideas for the other master services still on nn2 but I think I have to get nn1 to be active first.