Support Questions

Fawze · ‎02-27-2017

Hi,

Anyone can help to understand this ERROR:

The IP: 10.160.96.6 is the standby NN

2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_126195_m_000026_3: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1157585017_83846591 file=/user/dataint/.staging/job_1486363199991_126195/job.split
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:610)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:851)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:904)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:704)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
at org.apache.hadoop.io.Text.readString(Text.java:471)
at org.apache.hadoop.io.Text.readString(Text.java:464)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2017-02-26 01:35:40,427 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1486363199991_126195_m_000026_3 TaskAttempt Transitioned from RUNNING to FAIL_FINISHING_CONTAINER
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1486363199991_126195_m_000026 Task Transitioned from RUNNING to FAILED
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20
2017-02-26 01:35:40,429 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Job failed as tasks failed. failedMaps:1 failedReduces:0
2017-02-26 01:35:40,430 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1486363199991_126195Job Transitioned from RUNNING to FAIL_WAIT

2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000000_0
2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000001_0
2017-02-26 01:35:40,435 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2017-02-26 01:35:40,435 ERROR [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1486363199991_126195_r_000002_0

Fawze · ‎02-27-2017

The weird thing it's happening sporadic and on the same error on different block:

When i'm trying to list the file in the HDFS i cann't find the file, i suspect it happen on specifc disk on specific data node but it's happens only with one job, the job after 3 failures on the same node blaclisted the node till it blacklisted all data nodes and fail then.

At the next run it success

2017-02-27 13:36:03,460 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1486363199991_135199_m_000014_0: Error: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-76191351-10.160.96.6-1447247246852:blk_1158119244_84380902 file=/user/dataint/.staging/job_1486363199991_135199/job.split
	at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:963)

mbigelow · ‎02-27-2017

This looks disturbing to me. The staging directory is used by the job to create job specific data. So this file was written out after the job started but then the job is unable to read any of the blocks for the file. This could indicate a datanode that is likely to fail or maybe a few disks. It is likely that the file was created with a replication factor of 1.

Have you ran fsck against the parent folder or all of HDFS with the -list-corruptfileblocks and -blocks. Any other blocks missing or corrupt?

Fawze · ‎02-27-2017

Yes I did, but since i didn't the catch the issue on time i got:

The filesystem under path '/user/dataint/.staging' has 0 CORRUPT files

I will try to catch the issue when i happens, I'm suspect also on bad disk that might cause the issue, why such directory could be 1 replica? Is there a default for this, all my cluster with replication factor 3.

mbigelow · ‎02-27-2017

Since it is temporary data generated for a job and will be removed when a job is done, pass or fail, it would be created with a repl of 1. I don't know for certain but I could see it with repl of 1 because of this.

I don't know if this will help but the file itself is what the job creates to track the input splits of the job.

Fawze · ‎02-28-2017

Since i cann't catch the issue on time, i cann't know the host name that had the issue ...

Can i conclude it from the block name or something like? is such issue can be investigated in retrospectivie without monitoring the job till it failed?

I looked at all the DNs and all the disks are well distributed.

mbigelow · ‎02-28-2017

Try hdfs fsck -blockId blk_1157585017_83846591 if available. It is on my CDH 5.8.2 cluster but not on my CDH 5.4.5 cluster. I am getting on error on my CDH 5.8.2 cluster so I don't know if it will have the output you are looking for. You could try scanning all of the DFS data directories for the file manually as well.

It will only find it though if the block was created but which may not be the case if it says the job.split file doesn't exist any longer.

Fawze · ‎02-28-2017

since the job failed the file will not be available and deleted, also in the error i see it's alerting on a file including the job id under the folder but don't find this, also tried fsck but the file isn't exists

mbigelow · ‎03-01-2017

It is just this one job right? Can you provide the full job logs?

Fawze · ‎03-01-2017

only this job failed on this error,

Since i have only 10G for ts so i have history for 3 days, I restarted the DN at that time and the job didn't failed since 3 days, but still curious to understand the cause and how to handle it next time

Cloudera Community

Support Questions

Unable to read block from a DataNode

DataNode cannot send block report to NameNode due ...

Explaining "block missing" and "block corruption" ...

Datanode low number of blocks

Garbage Collection Pauses in Namenode and Datanode

Datanode Service Error Related to NFS Mount Issue

Unable to read Kafka topic messages

Unable to access Datanodes tab from Namenode UI

Fix Under-replicated blocks in HDFS manually

DataNode block report incomplete, RemoteException ...

Write / Read Parquet File in Spark