About RK

RK · ‎01-08-2019

Hi Fawze, This is not disk space issue. There is sufficient space on these large drives. Thanks

RK · ‎12-06-2018

This is not due to disk space, as there is sufficient disk space even when this job is running. Thanks RK

RK · ‎12-06-2018

here are links to pastebin which has excerpts from log files. There are one container log and other 3 datanode logs which were part of the same pipeline write operation. container log https://pastebin.com/SbMvr52W node 1 https://pastebin.com/hXbNXCe1 node 2 https://pastebin.com/qAMfkVsg node 3 https://pastebin.com/RL5W2qfp Thanks

RK · ‎12-05-2018

Grep on all 3 nodes involved in this operation did not have any string matched for XceiverCount and for "exceeds the limit of concurrent xceivers". Looks like pastebin link is expired, can you add it again, I will post for the duration. Did not see anything unusual though. Thanks

RK · ‎12-02-2018

One more thing that we noticed is whenever there are a bunch of failures. There are 1 or 2 servers in the write pipeline which had disk errors. These are mostly last nodes in the pipeline. Shouldn't it automatically skip if there is disk failure and go ahead with other 2 replicas where it succeeded? Thanks RK

RK · ‎11-30-2018

The code is in-house code. They don't fail consistently, this job launches 4000 containers, only hand full of them less than 60 of them fail. On their second attempt, they all succeed. There are 1 or 2 days in a month, the job completes without any failure. Thanks RK

RK · ‎11-29-2018

All the issues that were showing up on cloudera manager are fixed. The problem of disk aborting still exists. Are there any thresholds we are breaching? Thanks RK

RK · ‎11-28-2018

Hi Raman, These drives are not actually failed from the hardware side. See some alerts like "Clock offset" flapping every minute for different nodes. Other errors like free space, agent status, data directory status, frame errors. But none for these 3 hosts which has the replica on them. Thanks RK

RK · ‎11-27-2018

Hi, We have long yarn mapper tasks that run for 2 to 3 hours and there are few thousands of them which run in parallel. Every day we see around 5 to 20 of them fail when these tasks have progressed to 60 to 80%. They fail with the error message that points to disks that are bad and aborting. When these tasks rerun, then they succeed taking additional 2 to 3 hours. Version of hadoop - Hadoop 2.6.0-cdh5.5.1 Number of datanodes ~ 800 Few of the values that we tried increasing without any benefit are 1. increased open files 2. increase dfs.datanode.handler.count 3. increase dfs.datanode.max.xcievers 4. increase dfs.datanode.max.transfer.threads What could cause this, the source server fails to connect to itself and other 2 replica servers for 3 retries. This suggests that something on the source itself might be hitting some ceiling. Any thoughts will help? Error: java.io.IOException: java.io.IOException: All datanodes DatanodeInfoWithStorage[73.06.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at com.turn.platform.cheetah.storage.dmp.analytical_profile.merge.IncrementalProfileMergerMapper.close(IncrementalProfileMergerMapper.java:1185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[50.116.205.146:50010,DS-e037c4c3-571a-4cc3-ae3e-85d08790e188,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1328) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622) Thanks Kumar

RK · ‎06-22-2016

This was resolved by increasing heap size temporarily for cloudera manager.

Online	Offline
Last Visited	‎02-07-2019 02:52 PM

Member Since	‎09-24-2015 04:21 PM
Last Visited	‎02-07-2019 02:52 PM
Posts	19
Kudos received	2

Cloudera Community

Re: Cloudera manager 5.5.1 JVM hangs

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

Re: IOException All datanodes DatanodeInfoWithSto...

IOException All datanodes DatanodeInfoWithStorage...

Re: Cloudera manager 5.5.1 JVM hangs