Created 05-17-2019 04:04 PM
I have been facing frequent NN failover & checkpointing issues on my 6 node cluster on VMs, mostly the standby would remain in stopped state post a failover until started manually. Have tried increasing the QJM timeouts that helped with failover. However, the checkpointing issue remains. Have 3 journal nodes on node02, node03, node4 for example. Have found multiple edits_inprogress files on one of the journal node (node02), the other two had one each edits_inprogress.empty file (viz. on node03 &node04). After taking a backup, I deleted the extra edits_inprogress (leaving the most recent one) & the edits_inprogress.empty on the other two. Restarted all the JNs one by one. However, after checking the current directory on my journal node (node02) later - unfortunately it has two edits_inprogress files yet again. I can't seem to understand this behavior of this journal node. How & what is causing it to generate more than one edits_inprogress on this particular node (node02) ? Kindly help.
Created 05-17-2019 06:01 PM
Do you have Namenode HA enabled?
Created 05-18-2019 05:08 AM
@Geoffrey Shelton Okot : Thanks for responding. Yes, Namenode is HA enabled.
Created 05-18-2019 08:20 AM
Perfect that's an issue I resolved sometime already so let me document the process for you. Can you share the screenshot of the path to your edits_000000 files?
$ /hadoop/hdfs/journal/{cluster_name}/current/On the 3 journalnodes count the number of files in each journalnode get the count of the healthy journalnode. On all the journalnode in edits_000 directory run
$ cat last_promised-epoch
After I get the above output I will show you the steps
Created on 05-20-2019 08:53 AM - edited 08-17-2019 03:25 PM
Number of files present in the current directory (/hadoop/hdfs/journal/XXXXcluster/current) of each journal node -
node02 : 11078
node03 : 11082
node04 : 11081
Last promised epoch -
node02: 542
node03: 542
node04: 542
As asked, attaching screenshot of the path to edits_000000 files. (on node02)
Created 05-20-2019 04:31 PM
How do I fix one corrupted JN's edits?
Instructions to fix that one journal node.
1) Put both NN in safe mode from the active name node( NN HA)
$ hdfs dfsadmin -safemode enter
2) Save Namespace
$ hdfs dfsadmin -saveNamespace
Backup the edits_* in $ /hadoop/hdfs/journal/{cluster_name}/current/ on the node02 and node04 take note of the file permissions (screenshot would be important)
On node 2
# cd /hadoop/hdfs/journal/{cluster_name}/current/ # tar -czvf node2.tar.gz * # rm -rf *
On node 4
# cd /hadoop/hdfs/journal/{cluster_name}/current/ # tar -czvf node4.tar.gz * # rm -rf *
On the good node03
After zipping/tar the journal dir from a working JN node and copy it to the non-working JN node02,node04 to this path on node3
# cd /hadoop/hdfs/journal/{cluster_name}/current/ # tar -czvf node03.tar.gz *
Hoping you have the root password for the cluster
From node03 in the /hadoop/hdfs/journal/{cluster_name}/current/ directory run the below command to copy the good edits_* to node02 and node04
# scp node03.tar.gz root@node02:/hadoop/hdfs/journal/{cluster_name}/current/ # scp node03.tar.gz root@node04:/hadoop/hdfs/journal/{cluster_name}/current/
Having copied the zipped edit_* files open 3 windows and connect as root to node02 and node04 and run the below steps
On node02
# cd /hadoop/hdfs/journal/{cluster_name}/current/ # tar -xzvf node03.tar.gz
On node04
# cd /hadoop/hdfs/journal/{cluster_name}/current/ # tar -xzvf node03.tar.gz
Check the file permissions are okay
Stop the journal nodes on node02 and node04
Open 2 windows and run the below on node02 and node04
# su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh stop journalnode"
Restarting the journalnodes
Open 2 windows and run the below on node02 and node04
# su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start journalnode"
All these commands should run successfully
To validate after all start well you can restart all HDFS components
4) Restart HDFS
Your issue should now be resolved !!! Please revert
HTH
Created 05-21-2019 12:39 PM
Any updates?
Created 05-21-2019 05:20 PM
Hi, just saw your comment. Will apply the mentioned changes and revert with updates. Thank you.
Created 05-21-2019 07:04 PM
@Geoffrey Shelton Okot : Hi,
I have a doubt pertaining the number of files present in the current directory on each of the JNs - (I have checked it for a couple of times within a 5-minute period, number of files on both node03 & node04 remain the same while only node02 differs.)
node02 : 11040
node03 : 11043
node04 : 11043
Does it infer that edits on only node02 are corrupted while 03 & 04 are ok ? Kindly confirm if the above conjecture is correct or otherwise and whether I should go ahead with changes on both node02 & node04 or just node02. (Last promised epoch on all 3nodes - 577.)
Thanks.
Created 05-21-2019 08:09 PM
Which is the active node? Can you check again the last epoch on all the 3 nodes? Are you still experiencing the same problem?
Please revert so I can analyze your problem again?