Member since
09-24-2015
19
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6240 | 06-22-2016 06:18 PM |
01-08-2019
03:00 PM
Hi Fawze, This is not disk space issue. There is sufficient space on these large drives. Thanks
... View more
12-06-2018
12:19 AM
This is not due to disk space, as there is sufficient disk space even when this job is running. Thanks RK
... View more
12-06-2018
12:18 AM
here are links to pastebin which has excerpts from log files. There are one container log and other 3 datanode logs which were part of the same pipeline write operation. container log https://pastebin.com/SbMvr52W node 1 https://pastebin.com/hXbNXCe1 node 2 https://pastebin.com/qAMfkVsg node 3 https://pastebin.com/RL5W2qfp Thanks
... View more
12-05-2018
06:10 PM
Grep on all 3 nodes involved in this operation did not have any string matched for XceiverCount and for "exceeds the limit of concurrent xceivers". Looks like pastebin link is expired, can you add it again, I will post for the duration. Did not see anything unusual though. Thanks
... View more
12-02-2018
10:10 PM
One more thing that we noticed is whenever there are a bunch of failures. There are 1 or 2 servers in the write pipeline which had disk errors. These are mostly last nodes in the pipeline. Shouldn't it automatically skip if there is disk failure and go ahead with other 2 replicas where it succeeded? Thanks RK
... View more
11-30-2018
10:46 AM
The code is in-house code. They don't fail consistently, this job launches 4000 containers, only hand full of them less than 60 of them fail. On their second attempt, they all succeed. There are 1 or 2 days in a month, the job completes without any failure. Thanks RK
... View more
11-29-2018
10:17 PM
All the issues that were showing up on cloudera manager are fixed. The problem of disk aborting still exists. Are there any thresholds we are breaching? Thanks RK
... View more
11-28-2018
03:34 PM
Hi Raman, These drives are not actually failed from the hardware side. See some alerts like "Clock offset" flapping every minute for different nodes. Other errors like free space, agent status, data directory status, frame errors. But none for these 3 hosts which has the replica on them. Thanks RK
... View more
11-27-2018
02:18 PM
Hi, We have long yarn mapper tasks that run for 2 to 3 hours and there are few thousands of them which run in parallel. Every day we see around 5 to 20 of them fail when these tasks have progressed to 60 to 80%. They fail with the error message that points to disks that are bad and aborting. When these tasks rerun, then they succeed taking additional 2 to 3 hours. Version of hadoop - Hadoop 2.6.0-cdh5.5.1 Number of datanodes ~ 800 Few of the values that we tried increasing without any benefit are 1. increased open files 2. increase dfs.datanode.handler.count 3. increase dfs.datanode.max.xcievers 4. increase dfs.datanode.max.transfer.threads What could cause this, the source server fails to connect to itself and other 2 replica servers for 3 retries. This suggests that something on the source itself might be hitting some ceiling. Any thoughts will help? Error: java.io.IOException: java.io.IOException: All datanodes DatanodeInfoWithStorage[73.06.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at com.turn.platform.cheetah.storage.dmp.analytical_profile.merge.IncrementalProfileMergerMapper.close(IncrementalProfileMergerMapper.java:1185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[50.116.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1328) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622) Thanks Kumar
... View more
Labels:
- Labels:
-
Apache YARN
-
HDFS
06-22-2016
06:18 PM
2 Kudos
This was resolved by increasing heap size temporarily for cloudera manager.
... View more