We have long yarn mapper tasks that run for 2 to 3 hours and there are few thousands of them which run in parallel. Every day we see around 5 to 20 of them fail when these tasks have progressed to 60 to 80%. They fail with the error message that points to disks that are bad and aborting. When these tasks rerun, then they succeed taking additional 2 to 3 hours.
Version of hadoop - Hadoop 2.6.0-cdh5.5.1
Number of datanodes ~ 800
Few of the values that we tried increasing without any benefit are
1. increased open files
2. increase dfs.datanode.handler.count
3. increase dfs.datanode.max.xcievers
4. increase dfs.datanode.max.transfer.threads
What could cause this, the source server fails to connect to itself and other 2 replica servers for 3 retries. This suggests that something on the source itself might be hitting some ceiling. Any thoughts will help?
Error: java.io.IOException: java.io.IOException: All datanodes DatanodeInfoWithStorage[73.06.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at com.turn.platform.cheetah.storage.dmp.analytical_profile.merge.IncrementalProfileMergerMapper.close(IncrementalProfileMergerMapper.java:1185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[184.108.40.206:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1328) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)
Check the disk status for the DataNode that is mentioned in the exception.
Do you see any warning on your CM dashboard? If yes, can you post it?
These drives are not actually failed from the hardware side.
See some alerts like "Clock offset" flapping every minute for different nodes. Other errors like free space, agent status, data directory status, frame errors. But none for these 3 hosts which has the replica on them.
Lets start by fixing them one by one.
1. Start the ntpd service on all nodes to fix the clock offset problem if the service is not already started. If it is started, make sure that all the nodes refer to the same ntpd server
2. Check the space utilization for DNs that report "Free Space" issue. I would assume that you're reaching a certain threshold which is causing these alerts.
3. About agent status, could you show what the actual message is for this one? Alternatively, restart the cloudera-scm-agent service on the nodes that are hitting this alert and see if the alerts go away.
4. Post the exact message for Data Directory status.
5. Could you specify more about the frame errors, like exact message or a screenshot?
All the issues that were showing up on cloudera manager are fixed. The problem of disk aborting still exists. Are there any thresholds we are breaching?
Exception in the log snippet shown is related to class "com.turn.platform.cheetah.storage.dmp.analytical_profile.merge.IncrementalProfileMergerMapper.close".
Your DNs are aborting operation pointing to this class. This seems to be a custom 3rd party class. Kindly check with your vendor about this.
The code is in-house code. They don't fail consistently, this job launches 4000 containers, only hand full of them less than 60 of them fail. On their second attempt, they all succeed. There are 1 or 2 days in a month, the job completes without any failure.
I would still check with the developer as to why it fails the first time and not again. A certain paramter is being hit that we cannot determine from our end.
One more thing that we noticed is whenever there are a bunch of failures. There are 1 or 2 servers in the write pipeline which had disk errors. These are mostly last nodes in the pipeline. Shouldn't it automatically skip if there is disk failure and go ahead with other 2 replicas where it succeeded?
Editing my update:
Could you please post the DN logs in a pastbing link.. https://pastebin.com/
We can have a look at them. The exceptions given in the description seem to be a consequence of an earlier problem and hence looking at the DN logs before the mentioned exceptions should help us clarify the problem.
Also, grep your DN logs with "xceiverCount" or "exceeds the limit of concurrent xcievers" and post the results here.