I keep getting the follwoing health error message:
The filesystem checkpoint is 22 hour(s), 40 minute(s) old. This is 2,267.75% of the configured checkpoint period of 1 hour(s). Critical threshold: 400.00%. 10,775 transactions have occurred since the last filesystem checkpoint. This is 1.08% of the configured checkpoint transaction target of 1,000,000.
What is causing this and how can I get it to stop.
Logs:
Number of transactions: 8 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 6 SyncTimes(ms): 132
Number of transactions: 8 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 7 SyncTimes(ms): 155
Finalizing edits file /dfs/nn/current/edits_inprogress_0000000000000021523 -> /dfs/nn/current/edits_0000000000000021523-0000000000000021530
Starting log segment at 21531
Rescanning after 30000 milliseconds
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
list corrupt file blocks returned: 0
list corrupt file blocks returned: 0
BLOCK* allocateBlock: /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2014_10_09-07_29_47. BP-941526827-192.168.0.1-1412692043930 blk_1073744503_3679{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-a7270ad4-959d-4756-b731-83457af7c6a3:NORMAL|RBW]]}
BLOCK* addStoredBlock: blockMap updated: 192.168.0.102:50010 is added to blk_1073744503_3679{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-eaca52a9-2713-4901-b978-e331c17800fc:NORMAL|RBW]]} size 0
DIR* completeFile: /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2014_10_09-07_29_47 is closed by DFSClient_NONMAPREDUCE_592472068_72
BLOCK* addToInvalidates: blk_1073744503_3679 192.168.0.102:50010
BLOCK* BlockManager: ask 192.168.0.102:50010 to delete [blk_1073744503_3679]
Rescanning after 30001 milliseconds
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
Created 12-11-2015 01:20 AM
I found the fix for my issue.
As after format of namenode, checkpointing on snn was not happening bcoz of old namespace and blockpoolID on VERSION file.
After deleting the files under /data/dfs/snn. I restart the namenode and snn, later found it working fine.
Created 12-28-2016 05:57 PM
Created on 02-03-2016 08:52 AM - edited 02-03-2016 08:55 AM
Hi Harsh,
I am getting below exception on the namenode though it doesnt affect my services. But once there wasnt an automatic failover though it was enabled. I found out following error logs :
.
.
Forwardable Ticket true
Forwarded Ticket false
Proxiable Ticket false
Proxy Ticket false
Postdated Ticket false
Renewable Ticket false
Initial Ticket false
Auth Time = Wed Feb 03 13:49:37 CET 2016
Start Time = Wed Feb 03 13:49:40 CET 2016
End Time = Wed Feb 03 23:49:37 CET 2016
Renew Till = null
Client Addresses Null
2016-02-03 14:49:49,093 ERROR org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in doCheckpoint
java.io.IOException: Exception during image upload: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7))
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:221)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276)
Caused by: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7))
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:298)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:222)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:207)
..
.
.
.
.
Created 02-03-2016 12:03 PM
This is a Kerberos configuration issue, most likely with the principal for the second NameNode. When a checkpoint is attempted (copying the fsimage file from the Standby NameNode to the Active), the connection is failing due to the GSSAPI authentication with the Kerberos credential.
The failover controller logs will probably contain similar messages.
Since the server is able to start, your basic Kerberos setup is allowing the server to obtain it's initial credential but it appears it is expiring.
A few possible causes:
* The principal needs to have renewable tickets. In your output this is set to false. The problem could be with the /etc/krb5.conf file on the Standby or with the principal in your KDC.
* Reverse DNS lookup for the hostname is not working. The packet sent from one server has the information "my hostname is: server2.example.com, IP: 10.1.2.3". The source does a reverse DNS lookup for 10.1.2.3 and is not receiving a hostname or is receiving a hostname that does not match the one provided.
* You are having an intermittent outage with your KDC or DNS that is causing the above mentioned problems.
Depending upon the type of KDC in use and how it is configured, there may be additional issues. Since you report the rest of the cluster is functional (no loss to the DataNodes), this is most likely isolated to the one NameNode's principal.
David Wilder, Community Manager
Created 02-08-2016 05:25 AM
I found out that
principal of namenode : hdfs/xyz.munich.com@ABC.com
hostname : xyz.paris.com
hostname --fqdn : xyz.munich.com
So from above 3 values you can see that hostname is not same as principal and FQDN.
But as per my knowledge, only FQDN matters.
Still, do you think incorrect hostname can cause this issue?
Created 02-08-2016 05:27 AM
Please find a part of krb5.cnf
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
Created 10-22-2018 01:13 AM
In my case
reinstalled hdfs in CDH.
in SNN machine /hadoop/dfs/snn/current/fsimage_* is different NN /hadoop/dfs/nn/current/fsimage_*
Delete /hadoop/dfs/snn in SNN machine and then restart the SNN.