Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDFS Checkpoint status (HA)

Highlighted

HDFS Checkpoint status (HA)

New Contributor

Hi,

 

On one of our clusters one of our namenodes (HA setup) has bad health due to Checkpoint status:

 

The filesystem checkpoint is 10 hour(s), 30 minute(s) old. This is 1,051.25% of the configured checkpoint period of 1 hour(s). Critical threshold: 400.00%. 211,775 transactions have occurred since the last filesystem checkpoint. This is 21.18% of the configured checkpoint transaction target of 1,000,000.

 

When I check the contents of /opt/hadoop/dfs/clustername-ns/current i see the following for our 3 journal nodes:

 

192.168.12.1: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 05:33 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266642338.empty
192.168.12.1: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 09:38 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266743500.empty
192.168.12.1: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 15:09 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266834571
 
192.168.12.2: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 15:10 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266834571
 
192.168.12.3: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 05:09 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266631469
192.168.12.3: -rw-r--r-- 1 hdfs hadoop 1048576 Dec 22 15:10 /opt/hadoop/dfs/jn/CDH01-DWAS-ns/current/edits_inprogress_0000000000266834571

 

So, on the first node, there are 3 edits in progress, two of which end in .empty. On the third node there are two edits in progress, one of which not updated in 10 hours.

 

What do I do with these to get rid of that checkpoint status error (I'm assuming it is related to the checkpoint status error)?

 

Thanks,

Olivier

1 REPLY 1

Re: HDFS Checkpoint status (HA)

Master Guru
> (I'm assuming it is related to the checkpoint status error)?

Checkpoints are done by either the Standby role NameNode (in HDFS HA mode) or by the Secondary NameNode role (in non-HA mode). The JournalNodes are not really involved in this operation, at least not directly.

Check your Standby NameNode logs for Checkpoint-related logs to begin an investigation on why that operation is failing.