- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Getting error "Exception in doCheckpoint java.io.IOException: Unable to download to any storage directory"
- Labels:
-
Apache Hadoop
Created ‎04-14-2016 07:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
Below are the errors seen in secondary namenode, there are about 6000 under replicated blocks too, not sure if its related to this issue only. DN health is fine. Appreciate any pointers.
==
2016-04-12 15:45:03,660 INFO namenode.SecondaryNameNode (SecondaryNameNode.java:run(453)) - Image has not changed. Will not download image. 2016-04-12 15:45:03,661 INFO namenode.TransferFsImage (TransferFsImage.java:getFileClient(394)) - Opening connection to http://ey9omprna005.vzbi.com:50070/imagetransfer?getedit=1&startTxId=236059442&endTxId=23608. 2016-04-12 15:45:03,665 ERROR namenode.SecondaryNameNode (SecondaryNameNode.java:doWork(399)) - Exception in doCheckpoint java.io.IOException: Unable to download to any storage directory
Created ‎04-14-2016 07:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@gsharma - By looking at the error it looks like some problem with secondary namenode's local storage.
can you please check value of dfs.namenode.checkpoint.dir and see if any issues like RO mount or storage full or bad disk maybe?
Also, undereplicated block issue is not related to this one.
How many datanodes you have? what is the replication factor? are all the datanodes healthy ?
Created ‎04-14-2016 07:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@gsharma - By looking at the error it looks like some problem with secondary namenode's local storage.
can you please check value of dfs.namenode.checkpoint.dir and see if any issues like RO mount or storage full or bad disk maybe?
Also, undereplicated block issue is not related to this one.
How many datanodes you have? what is the replication factor? are all the datanodes healthy ?
Created ‎04-15-2016 05:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Kuldeep Kulkarni Here is my response to your queries.
1. checked , no mount points are read only.
2. Check df -h on both nodes, no space issues.
3. we have 5 DNs , RF is 3 ,
4. All datanodes seem to be healthy with only 50 % of DFS utilization.
==
Here is my investigation so far.
>> The last successful fsimage on NN is on April 5th
----
-rw-r--r-- 1 hdfs hadoop 411408570 Apr 5 20:11 fsimage_0000000000236059441
-----
>> Before the above, the last checkpoint file goes back to Feb. About a gap of 40 + days.
-----
-rw-r--r-- 1 hdfs hadoop 144021898 Feb 24 2015 fsimage.ckpt_0000000000039014860
----
>> The same fsimage reaches secondary namenode at :-
----
-rw-r--r-- 1 hdfs hadoop 411408570 Apr 5 22:11 fsimage_0000000000236059441
-----
Now secondary namenode merges edit with recently acquired fsimage and creates a new fsimage to be fetched by primary NN
-----
-rw-r--r-- 1 hdfs hadoop 42688512 Apr 6 10:12 fsimage.ckpt_0000000000236367214
-----
>> No transactions are visible either on Namenode or on Secondary after that.
===
On secondary NN hadoop-hdfs-secondarynamenode-xxx-yy.out , I can see below errors
===
java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:517) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:431) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:395) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadEditsToStorage(TransferFsImage.java:167) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:444) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.downloadCheckpointFiles(SecondaryNameNode.java:443) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:540) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357) at java.lang.Thread.run(Thread.java:745) log4j:ERROR Failed to flush writer,
====
And below error comes after the above : -
====
java.io.IOException: Unable to download to any storage directory at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:505) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:431) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:395) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadEditsToStorage(TransferFsImage.java:167) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:444) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.downloadCheckpointFiles(SecondaryNameNode.java:443) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:540) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357) at java.lang.Thread.run(Thread.java:745)
=====
Now, apart from our investigation, wanted to clarify if
1. no space left on device error comes from primary NN when it tries to fetch the fsimage from SNN ? Or it comes SNN itself to not able to download the old fsimage it gets from PNN.
There are no timestamps in .out file so cant actually serialize the issue / patterns / errors.
Created ‎04-15-2016 06:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@gaurav sharma - If you look at logs carefully, I noticed below message
java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.receiveFile(TransferFsImage.java:517) at
1. Can you please move existing fsimage from SNN to some other location, make sure that disk on SNN has capacity to store fsimage from NN ( check size of fsimage on NN and see if total disk capacity on SNN sufficient to store fsimage )
2. Shutdown Secondary NN
3. Run below command to force secondary NN to do checkpointing
hadoop secondarynamenode -checkpoint force
Note - Please run above command by hdfs user.
Created ‎04-14-2016 12:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you tried restarting secondary namnode? if not then first we would try to do restart.
Created ‎04-15-2016 05:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No Jitendra , not yet tried since its prod env. , how about first trying to force a manual checkpoint rather than reboot ? Need your suggestion from applying this action plan in prod environment perspective ?
Created ‎04-15-2016 11:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes ,you can try forcing checkpoint first but I doubt if this works, also can you check whether you have sufficient local disk space of SNN node plus on hdfs also? if disk space is not a prob then we can restart SNN since it will not cause any issue to PNN as well as on running jobs.
