Created 11-20-2015 06:37 AM
We have been seeing errors consistently in the NN logs related to checkpointing. Our NNs are not able to automatically perform a checkpoint - the only way is for us to put them in Safe Mode and manually run a Save Namespace command. We see these errors over and over in the logs:
Exception in doCheckpoint java.io.IOException: Exception during image upload: org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:221) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276) Caused by: org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:222) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:207) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:204) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Exception in doCheckpoint java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:450) at org.apache.hadoop.io.Text.encode(Text.java:431) at org.apache.hadoop.io.Text.writeString(Text.java:491) at org.apache.hadoop.fs.permission.PermissionStatus.write(PermissionStatus.java:117) at org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.writePermissionStatus(FSImageSerialization.java:99) at org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.writeINodeFileAttributes(FSImageSerialization.java:216) at org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff.write(FileDiff.java:81) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotFSImageFormat.saveINodeDiffs(SnapshotFSImageFormat.java:89) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotFSImageFormat.saveFileDiffList(SnapshotFSImageFormat.java:102) at org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.writeINodeFile(FSImageSerialization.java:196) at org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.saveINode2Image(FSImageSerialization.java:332) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveINode2Image(FSImageFormat.java:1433) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveChildren(FSImageFormat.java:1335) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveImage(FSImageFormat.java:1393) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveImage(FSImageFormat.java:1408) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveImage(FSImageFormat.java:1408) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveImage(FSImageFormat.java:1408) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveImage(FSImageFormat.java:1408) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.saveImage(FSImageFormat.java:1408) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:1279) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveLegacyOIVImage(FSImage.java:973) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:193) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276)
Has anyone seen this or found a solution for it?
We are running CM 5.4.7 and CDH 5.4.0
Created 11-20-2015 09:29 AM
David Wilder, Community Manager
Tyler,
During normal operation, every hour the Standby NameNode will send an http (or https) ping to the Active NameNode to let it know a new checkpoint is ready. The Active NameNode will make an http (or https) request back to the Standby and download the checkpoint file.
From you stack trace it appears there is an issue in this communication flow.
David Wilder, Community Manager
Created 12-07-2015 05:56 AM
Thanks, David.
It turns out the fix for the error we were seeing wasn't included in the version of CDH we are running. Once we upgrade to this version, we should no longer see this issue.
Created 04-14-2016 07:51 AM
Hi Tyler,
Could you plese indicate to what CDH version did you upgrade to have this issue fixed?
Thanks and regards,
Javier.
Created 04-14-2016 08:07 AM
@Javier - I don't know the exact version this was released in, but I think the JIRA that we were hitting was HDFS-7798