Support Questions

Find answers, ask questions, and share your expertise

Multiple edits_inprogress files on one of the journal nodes.

Explorer

I have been facing frequent NN failover & checkpointing issues on my 6 node cluster on VMs, mostly the standby would remain in stopped state post a failover until started manually. Have tried increasing the QJM timeouts that helped with failover. However, the checkpointing issue remains. Have 3 journal nodes on node02, node03, node4 for example. Have found multiple edits_inprogress files on one of the journal node (node02), the other two had one each edits_inprogress.empty file (viz. on node03 &node04). After taking a backup, I deleted the extra edits_inprogress (leaving the most recent one) & the edits_inprogress.empty on the other two. Restarted all the JNs one by one. However, after checking the current directory on my journal node (node02) later - unfortunately it has two edits_inprogress files yet again. I can't seem to understand this behavior of this journal node. How & what is causing it to generate more than one edits_inprogress on this particular node (node02) ? Kindly help.

edits_inprogress.png


@Jay Kumar SenSharma

@Geoffrey Shelton Okot

18 REPLIES 18

Mentor

@Farhana Khan

Do you have Namenode HA enabled?

Explorer

@Geoffrey Shelton Okot : Thanks for responding. Yes, Namenode is HA enabled.

Mentor

@Farhana Khan

Perfect that's an issue I resolved sometime already so let me document the process for you. Can you share the screenshot of the path to your edits_000000 files?

$ /hadoop/hdfs/journal/{cluster_name}/current/
On the 3 journalnodes count the number of files in each journalnode get the count of the healthy journalnode. On all the journalnode in edits_000 directory run
$ cat last_promised-epoch

After I get the above output I will show you the steps

Explorer

@Geoffrey Shelton Okot :

Number of files present in the current directory (/hadoop/hdfs/journal/XXXXcluster/current) of each journal node -

node02 : 11078

node03 : 11082

node04 : 11081

Last promised epoch -

node02: 542

node03: 542

node04: 542


As asked, attaching screenshot of the path to edits_000000 files. (on node02)

108786-jn-current-edits.png

Mentor

@Farhana Khan

How do I fix one corrupted JN's edits?

Instructions to fix that one journal node.

1) Put both NN in safe mode from the active name node( NN HA)

$ hdfs dfsadmin -safemode enter 

2) Save Namespace

$ hdfs dfsadmin -saveNamespace 

Backup the edits_* in $ /hadoop/hdfs/journal/{cluster_name}/current/ on the node02 and node04 take note of the file permissions (screenshot would be important)

On node 2

# cd /hadoop/hdfs/journal/{cluster_name}/current/
# tar -czvf node2.tar.gz *
# rm -rf *

On node 4

# cd /hadoop/hdfs/journal/{cluster_name}/current/
# tar -czvf node4.tar.gz *
# rm -rf *

On the good node03

After zipping/tar the journal dir from a working JN node and copy it to the non-working JN node02,node04 to this path on node3

# cd /hadoop/hdfs/journal/{cluster_name}/current/
# tar -czvf node03.tar.gz *

Hoping you have the root password for the cluster

From node03 in the /hadoop/hdfs/journal/{cluster_name}/current/ directory run the below command to copy the good edits_* to node02 and node04

# scp node03.tar.gz root@node02:/hadoop/hdfs/journal/{cluster_name}/current/
# scp node03.tar.gz root@node04:/hadoop/hdfs/journal/{cluster_name}/current/

Having copied the zipped edit_* files open 3 windows and connect as root to node02 and node04 and run the below steps

On node02

# cd /hadoop/hdfs/journal/{cluster_name}/current/
# tar -xzvf node03.tar.gz

On node04

# cd /hadoop/hdfs/journal/{cluster_name}/current/
# tar -xzvf node03.tar.gz


Check the file permissions are okay


Stop the journal nodes on node02 and node04

Open 2 windows and run the below on node02 and node04

# su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh stop journalnode"

Restarting the journalnodes

Open 2 windows and run the below on node02 and node04

# su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start journalnode"

All these commands should run successfully

To validate after all start well you can restart all HDFS components

4) Restart HDFS

Your issue should now be resolved !!! Please revert

HTH

Mentor

@Farhana Khan

Any updates?

Explorer

Hi, just saw your comment. Will apply the mentioned changes and revert with updates. Thank you.

Explorer

@Geoffrey Shelton Okot : Hi,

I have a doubt pertaining the number of files present in the current directory on each of the JNs - (I have checked it for a couple of times within a 5-minute period, number of files on both node03 & node04 remain the same while only node02 differs.)

node02 : 11040

node03 : 11043
node04 : 11043

Does it infer that edits on only node02 are corrupted while 03 & 04 are ok ? Kindly confirm if the above conjecture is correct or otherwise and whether I should go ahead with changes on both node02 & node04 or just node02. (Last promised epoch on all 3nodes - 577.)

Thanks.

Mentor

@Farhana Khan

Which is the active node? Can you check again the last epoch on all the 3 nodes? Are you still experiencing the same problem?
Please revert so I can analyze your problem again?

Explorer

@Geoffrey Shelton Okot:

Currently node03 is active NN.

Last promised epoch on all 3 nodes : 591 &

Number of file in /hadoop/hdfs/journal/XXXXcluster/current are as follows:

node02 : 10281

node03 : 10284
node04 : 10284



Yes, still experiencing the same problem.

  • Two edits_inprogress files on node02 ..../journal/../current directory.
  • Standby NN goes into stopped state frequently.
  • Checkpointing does not occur every 14400 seconds as configured in checkpoint.period

(Last night, standby NN i.e. node02 was in stopped state, had to manually start it from Ambari. Checked the

Last checkpoint : Tue May 21 2019 17:14:59, as on

ntp time : Wed May 22 2019 1:03:36 )


As for today,

Last checkpoint time : Wed May 22 2019 10:18:33 , whereas

ntp time : Wed May 22 2019 19:10:52 )


Mentor

@Farhana Khan

node02 is the problem proceed with the documentation I had compiled before. follow diligently

Very important always backup the files on all the nodes before starting the procedure !!!!! just a simple zip is okay and then copy over the good file from node3 or node1 /hadoop/hdfs/journal/XXXXcluster/current /* to node2 same direction and follow steps

Please revert am sure will smiles 🙂



Explorer

@Geoffrey Shelton Okot:

Completed the steps on node02 & node03 from the healthy set of edits from node04.

(there were 2 edits_inprogress on node02 & one edits_inprogress.empty on node03)

There were some permissions issues, got through. However, while restarting HDFS components through ambari the Namenode on node03 was stuck & not able to start. After a pretty long time the operation completed but the standby NN on node03 was in stopped state. The errors are as attached. (unable to upload file hence the screenshot)


108944-stderr-nn3-2019-05-25-19-53-04.png


108935-stdout-nn3-2019-05-25-19-53-41.png


Both NNs are up and running for now. Observing the cluster for another 24 hours to see if the issue has resolved for the better. Will keep you posted about the checkpointing status.

Thanks a lot for your invaluable help.


Thanks and Regards,

Farhana.


Explorer

Also, found this edits_inprogress.empty file in namenode/current directory on node02. 😞

108945-nn-node02.png

@Geoffrey Shelton Okot


Mentor

@Farhana Khan

Please keep me posted but, normally there shouldn't be an issue I have performed these steps on production clusters. I have seen the screenshot but that edits_inprogress is dated May 10 and the rest have correct date stamps.


Anyways let's keep watch

Explorer

@Geoffrey Shelton Okot:

Hi,

Checkpointing issue remains. Last checkpoint was done 18 hours back. No fsimage after that.

Also, the good node i.e. node04 has two edits_inprogress now. I don't know what to comprehend of this.

108922-node04-2019-05-26-14-53-56.png


Mentor

@Farhana Khan

The initial screenshot was showing a discrepancy of edit_* on node02 with last epoch 542 and the new screenshot is that of node04 can you provide the comparison tables of the 3 nodes? Edit_* and last epoch and I would also want to see the screenshot of the Ambari UI of the Primary and Standby namenode status.


To test the failover you can stop the active namenode it should failover to the standby the process could take some seconds and once the standby had transitioned to Primary then restart the Ex-primary which should become standby namenode


what is the status of the ZKFailoverControllers?

Explorer

@Geoffrey Shelton Okot

Hi,


node02
node03
node04
edits_*
10414
10414
10412
epoch
684
684
684


109011-ambari-2019-05-28-05-17-47.png


NN failover tested OK.

ZKFC status OK.



Mentor

@Farhana Khan

Any updates on this issue?