Created 04-14-2016 07:02 AM
Hello All,
Below are the errors seen in secondary namenode, there are about 6000 under replicated blocks too, not sure if its related to this issue only. DN health is fine. Appreciate any pointers.
==
2016-04-12 15:45:03,660 INFO namenode.SecondaryNameNode (SecondaryNameNode.java:run(453)) - Image has not changed. Will not download image. 2016-04-12 15:45:03,661 INFO namenode.TransferFsImage (TransferFsImage.java:getFileClient(394)) - Opening connection to http://ey9omprna005.vzbi.com:50070/imagetransfer?getedit=1&startTxId=236059442&endTxId=23608. 2016-04-12 15:45:03,665 ERROR namenode.SecondaryNameNode (SecondaryNameNode.java:doWork(399)) - Exception in doCheckpoint java.io.IOException: Unable to download to any storage directory
Created 04-14-2016 07:31 AM
@gsharma - By looking at the error it looks like some problem with secondary namenode's local storage.
can you please check value of dfs.namenode.checkpoint.dir and see if any issues like RO mount or storage full or bad disk maybe?
Also, undereplicated block issue is not related to this one.
How many datanodes you have? what is the replication factor? are all the datanodes healthy ?
Created 04-14-2016 07:31 AM
@gsharma - By looking at the error it looks like some problem with secondary namenode's local storage.
can you please check value of dfs.namenode.checkpoint.dir and see if any issues like RO mount or storage full or bad disk maybe?
Also, undereplicated block issue is not related to this one.
How many datanodes you have? what is the replication factor? are all the datanodes healthy ?
Created 04-15-2016 05:33 AM
@Kuldeep Kulkarni Here is my response to your queries.
1. checked , no mount points are read only.
2. Check df -h on both nodes, no space issues.
3. we have 5 DNs , RF is 3 ,
4. All datanodes seem to be healthy with only 50 % of DFS utilization.
==
Here is my investigation so far.
>> The last successful fsimage on NN is on April 5th
----
-rw-r--r-- 1 hdfs hadoop 411408570 Apr 5 20:11 fsimage_0000000000236059441
-----
>> Before the above, the last checkpoint file goes back to Feb. About a gap of 40 + days.
-----
-rw-r--r-- 1 hdfs hadoop 144021898 Feb 24 2015 fsimage.ckpt_0000000000039014860
----
>> The same fsimage reaches secondary namenode at :-
----
-rw-r--r-- 1 hdfs hadoop 411408570 Apr 5 22:11 fsimage_0000000000236059441
-----
Now secondary namenode merges edit with recently acquired fsimage and creates a new fsimage to be fetched by primary NN
-----
-rw-r--r-- 1 hdfs hadoop 42688512 Apr 6 10:12 fsimage.ckpt_0000000000236367214
-----
>> No transactions are visible either on Namenode or on Secondary after that.
===
On secondary NN hadoop-hdfs-secondarynamenode-xxx-yy.out , I can see below errors
===
java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:517) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:431) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:395) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadEditsToStorage(TransferFsImage.java:167) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:444) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.downloadCheckpointFiles(SecondaryNameNode.java:443) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:540) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357) at java.lang.Thread.run(Thread.java:745) log4j:ERROR Failed to flush writer,
====
And below error comes after the above : -
====
java.io.IOException: Unable to download to any storage directory at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:505) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:431) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:395) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadEditsToStorage(TransferFsImage.java:167) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:444) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.downloadCheckpointFiles(SecondaryNameNode.java:443) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:540) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357) at java.lang.Thread.run(Thread.java:745)
=====
Now, apart from our investigation, wanted to clarify if
1. no space left on device error comes from primary NN when it tries to fetch the fsimage from SNN ? Or it comes SNN itself to not able to download the old fsimage it gets from PNN.
There are no timestamps in .out file so cant actually serialize the issue / patterns / errors.
Created 04-15-2016 06:21 AM
@gaurav sharma - If you look at logs carefully, I noticed below message
java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:517) at
1. Can you please move existing fsimage from SNN to some other location, make sure that disk on SNN has capacity to store fsimage from NN ( check size of fsimage on NN and see if total disk capacity on SNN sufficient to store fsimage )
2. Shutdown Secondary NN
3. Run below command to force secondary NN to do checkpointing
hadoop secondarynamenode -checkpoint force
Note - Please run above command by hdfs user.
Created 04-14-2016 12:18 PM
Did you tried restarting secondary namnode? if not then first we would try to do restart.
Created 04-15-2016 05:21 AM
No Jitendra , not yet tried since its prod env. , how about first trying to force a manual checkpoint rather than reboot ? Need your suggestion from applying this action plan in prod environment perspective ?
Created 04-15-2016 11:39 AM
Yes ,you can try forcing checkpoint first but I doubt if this works, also can you check whether you have sufficient local disk space of SNN node plus on hdfs also? if disk space is not a prob then we can restart SNN since it will not cause any issue to PNN as well as on running jobs.