Reply
New Contributor
Posts: 1
Registered: ‎12-22-2017

HDFS Checkpoint status (HA)

Hi,

 

On one of our clusters one of our namenodes (HA setup) has bad health due to Checkpoint status:

 

The filesystem checkpoint is 10 hour(s), 30 minute(s) old. This is 1,051.25% of the configured checkpoint period of 1 hour(s). Critical threshold: 400.00%. 211,775 transactions have occurred since the last filesystem checkpoint. This is 21.18% of the configured checkpoint transaction target of 1,000,000.

 

When I check the contents of /opt/hadoop/dfs/clustername-ns/current i see the following for our 3 journal nodes:

 

192.168.12.1: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 05:33 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266642338.empty
192.168.12.1: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 09:38 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266743500.empty
192.168.12.1: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 15:09 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266834571
 
192.168.12.2: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 15:10 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266834571
 
192.168.12.3: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 05:09 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266631469
192.168.12.3: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 15:10 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266834571

 

So, on the first node, there are 3 edits in progress, two of which end in .empty. On the third node there are two edits in progress, one of which not updated in 10 hours.

 

What do I do with these to get rid of that checkpoint status error (I'm assuming it is related to the checkpoint status error)?

 

Thanks,

Olivier

Highlighted
Posts: 1,760
Kudos: 379
Solutions: 282
Registered: ‎07-31-2013

Re: HDFS Checkpoint status (HA)

> (I'm assuming it is related to the checkpoint status error)?

Checkpoints are done by either the Standby role NameNode (in HDFS HA mode) or by the Secondary NameNode role (in non-HA mode). The JournalNodes are not really involved in this operation, at least not directly.

Check your Standby NameNode logs for Checkpoint-related logs to begin an investigation on why that operation is failing.
Announcements