Support Questions

Find answers, ask questions, and share your expertise

checkpoint is not occuring

avatar

Dear all,  I REcently enabled  HA With my namenode.
i started to see issue with my CHECKPOINT process, Means, CHeckPOInt did not occur for past 5 hours.

Here go my observation. Have you seen this case before. or am i hitting any BUG?

Kind share your advice to crack this issue out ... 
 
 As per checkpoint process,
When the updated FSIMAGE get downloaded to "NAMENODE" from "STANDBY NAMENODE", 
The "FSIMAGE.ckpt_txid" must be renamed to "FSIMAGE_txid" But It's not happening in my case.
 
I did not see any file named with "FSIMAGE_txid" in my namenode , All are looks like  "FSIMAGE.ckpt_txid".
So I just compared both  "FSIMAGE.ckpt_txid" & "FSIMAGE_txid" ,Both got same checksum value.
 
FSIMAGE.ckpt_txid is from NAMENODE
FSIMAGE_txid is from SECONDARYNAMENODE
 
namenode:
=========
root@namenode:/mnt/sdb/name/current# cksum fsimage.ckpt_0000000000604392126
3708522794 2148716968 fsimage.ckpt_0000000000604392126
 
secondary-namenode:
================
root@secondary-namenode:/mnt/sdd/name/current# cksum fsimage_0000000000604392126
3708522794 2148716968 fsimage_0000000000604392126
 
NOTE: I did not see twork issueany ne, i am able to download the fsimage using "wget" Command.
 
i am using cdh 4.1.3 & Cloudera Enterprise 4.6.3 
 
Best Regards,
BOMmuraj
1 ACCEPTED SOLUTION

avatar
Thank you Harsh for your email !!! i was hitting below issue, I increased this "dfs.image.transfer.timeout" and it fixed the issue. https://issues.apache.org/jira/browse/HDFS-4301 Checkpoint was working fine but the issue started when my fsimage size reached 2.1GB. Best Regards, Bommuraj

View solution in original post

2 REPLIES 2

avatar
Mentor
It is difficult to say if you are hitting a bug without looking at relevant Checkpointer placed entries in the StandbyNameNode (SBN) logs.

There may be issues with transferring the file between the SBN and the NN, probably cause of timeouts or otherwise.

avatar
Thank you Harsh for your email !!! i was hitting below issue, I increased this "dfs.image.transfer.timeout" and it fixed the issue. https://issues.apache.org/jira/browse/HDFS-4301 Checkpoint was working fine but the issue started when my fsimage size reached 2.1GB. Best Regards, Bommuraj