Created on 02-27-2017 07:09 AM - edited 02-27-2017 01:02 PM
Hi,
Anyone can help to understand this ERROR:
The IP: 10.160.96.6 is the standby NN
2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_126195_m_000026_3: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1157585017_83846591 file=/user/dataint/.staging/job_1486363199991_126195/job.split
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:610)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:851)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:904)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:704)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
at org.apache.hadoop.io.Text.readString(Text.java:471)
at org.apache.hadoop.io.Text.readString(Text.java:464)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1486363199991_126195_m_000026_3 TaskAttempt Transitioned from RUNNING to FAIL_FINISHING_CONTAINER
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1486363199991_126195_m_000026 Task Transitioned from RUNNING to FAILED
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Job failed as tasks failed. failedMaps:1 failedReduces:0
2017-02-26 01:35:40,430 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1486363199991_126195Job Transitioned from RUNNING to FAIL_WAIT
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000000_0
2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000001_0
2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000002_0
Created 02-27-2017 12:35 PM
The weird thing it's happening sporadic and on the same error on different block:
When i'm trying to list the file in the HDFS i cann't find the file, i suspect it happen on specifc disk on specific data node but it's happens only with one job, the job after 3 failures on the same node blaclisted the node till it blacklisted all data nodes and fail then.
At the next run it success
2017-02-27 13:36:03,460 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_135199_m_000014_0: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1158119244_84380902 file=/user/dataint/.staging/job_1486363199991_135199/job.split at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)
Created 02-27-2017 10:52 PM
Created 02-27-2017 11:01 PM
Yes I did, but since i didn't the catch the issue on time i got:
The filesystem under path '/user/dataint/.staging' has 0 CORRUPT files
I will try to catch the issue when i happens, I'm suspect also on bad disk that might cause the issue, why such directory could be 1 replica? Is there a default for this, all my cluster with replication factor 3.
Created 02-27-2017 11:15 PM
Created 02-28-2017 12:05 AM
Since i cann't catch the issue on time, i cann't know the host name that had the issue ...
Can i conclude it from the block name or something like? is such issue can be investigated in retrospectivie without monitoring the job till it failed?
I looked at all the DNs and all the disks are well distributed.
Created 02-28-2017 10:40 AM
Created 02-28-2017 10:52 AM
since the job failed the file will not be available and deleted, also in the error i see it's alerting on a file including the job id under the folder but don't find this, also tried fsck but the file isn't exists
Created 03-01-2017 10:02 AM
Created 03-01-2017 10:23 AM
only this job failed on this error,
Since i have only 10G for ts so i have history for 3 days, I restarted the DN at that time and the job didn't failed since 3 days, but still curious to understand the cause and how to handle it next time