Support Questions

RK · ‎11-27-2018

Hi,

We have long yarn mapper tasks that run for 2 to 3 hours and there are few thousands of them which run in parallel. Every day we see around 5 to 20 of them fail when these tasks have progressed to 60 to 80%. They fail with the error message that points to disks that are bad and aborting. When these tasks rerun, then they succeed taking additional 2 to 3 hours.

Version of hadoop - Hadoop 2.6.0-cdh5.5.1

Number of datanodes ~ 800

Few of the values that we tried increasing without any benefit are

1. increased open files

2. increase dfs.datanode.handler.count

3. increase dfs.datanode.max.xcievers

4. increase dfs.datanode.max.transfer.threads

What could cause this, the source server fails to connect to itself and other 2 replica servers for 3 retries. This suggests that something on the source itself might be hitting some ceiling. Any thoughts will help?

Error: java.io.IOException: java.io.IOException: All datanodes DatanodeInfoWithStorage[73.06.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at com.turn.platform.cheetah.storage.dmp.analytical_profile.merge.IncrementalProfileMergerMapper.close(IncrementalProfileMergerMapper.java:1185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[50.116.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1328) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)

Thanks

Kumar

RK · ‎12-05-2018

Grep on all 3 nodes involved in this operation did not have any string matched for XceiverCount and for "exceeds the limit of concurrent xceivers".

Looks like pastebin link is expired, can you add it again, I will post for the duration. Did not see anything unusual though.

Thanks

ramanhopes · ‎12-05-2018

You can go to the link again and click on "+ new paste" for a new text field to post the logs. Once done, scroll below and click on "create new paste". A link will be generated. Share that link with us.

Fawze · ‎12-05-2018

Hey,

Once the job failed, the disk space disappear?

Can you check if the disk space occur on the application master nodes?

I assume this is the container logs and you can check this while the job running.

RK · ‎12-06-2018

This is not due to disk space, as there is sufficient disk space even when this job is running.

Thanks

RK

RK · ‎01-08-2019

Hi Fawze,

This is not disk space issue. There is sufficient space on these large drives.

Thanks

RK · ‎12-06-2018

here are links to pastebin which has excerpts from log files.

There are one container log and other 3 datanode logs which were part of the same pipeline write operation.

container log
https://pastebin.com/SbMvr52W

node 1
https://pastebin.com/hXbNXCe1

node 2
https://pastebin.com/qAMfkVsg

node 3
https://pastebin.com/RL5W2qfp

Thanks

Cloudera Community

Support Questions

IOException All datanodes DatanodeInfoWithStorage ,DISK] are bad. Aborting