- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Unable to read block from a DataNode
- Labels:
-
MapReduce
Created on ‎02-27-2017 07:09 AM - edited ‎02-27-2017 01:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Anyone can help to understand this ERROR:
The IP: 10.160.96.6 is the standby NN
2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_126195_m_000026_3: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1157585017_83846591 file=/user/dataint/.staging/job_1486363199991_126195/job.split
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:610)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:851)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:904)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:704)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
at org.apache.hadoop.io.Text.readString(Text.java:471)
at org.apache.hadoop.io.Text.readString(Text.java:464)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1486363199991_126195_m_000026_3 TaskAttempt Transitioned from RUNNING to FAIL_FINISHING_CONTAINER
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1486363199991_126195_m_000026 Task Transitioned from RUNNING to FAILED
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Job failed as tasks failed. failedMaps:1 failedReduces:0
2017-02-26 01:35:40,430 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1486363199991_126195Job Transitioned from RUNNING to FAIL_WAIT
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000000_0
2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000001_0
2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000002_0
Created ‎02-27-2017 12:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The weird thing it's happening sporadic and on the same error on different block:
When i'm trying to list the file in the HDFS i cann't find the file, i suspect it happen on specifc disk on specific data node but it's happens only with one job, the job after 3 failures on the same node blaclisted the node till it blacklisted all data nodes and fail then.
At the next run it success
2017-02-27 13:36:03,460 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_135199_m_000014_0: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1158119244_84380902 file=/user/dataint/.staging/job_1486363199991_135199/job.split at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)
Created ‎02-27-2017 10:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you ran fsck against the parent folder or all of HDFS with the -list-corruptfileblocks and -blocks. Any other blocks missing or corrupt?
Created ‎02-27-2017 11:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes I did, but since i didn't the catch the issue on time i got:
The filesystem under path '/user/dataint/.staging' has 0 CORRUPT files
I will try to catch the issue when i happens, I'm suspect also on bad disk that might cause the issue, why such directory could be 1 replica? Is there a default for this, all my cluster with replication factor 3.
Created ‎02-27-2017 11:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know if this will help but the file itself is what the job creates to track the input splits of the job.
Created ‎02-28-2017 12:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since i cann't catch the issue on time, i cann't know the host name that had the issue ...
Can i conclude it from the block name or something like? is such issue can be investigated in retrospectivie without monitoring the job till it failed?
I looked at all the DNs and all the disks are well distributed.
Created ‎02-28-2017 10:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It will only find it though if the block was created but which may not be the case if it says the job.split file doesn't exist any longer.
Created ‎02-28-2017 10:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
since the job failed the file will not be available and deleted, also in the error i see it's alerting on a file including the job id under the folder but don't find this, also tried fsck but the file isn't exists
Created ‎03-01-2017 10:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎03-01-2017 10:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
only this job failed on this error,
Since i have only 10G for ts so i have history for 3 days, I restarted the DN at that time and the job didn't failed since 3 days, but still curious to understand the cause and how to handle it next time
