Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDFS Web Server Health goes Concerning/Bad during MapReduce Job

HDFS Web Server Health goes Concerning/Bad during MapReduce Job

New Contributor

Hi,

 

I have deployed a CDH 5.3 cluster using Sahara in OpenStack Cloud. I have 1 CM instance, 1 Namenode instance (YARN_ResourceManager, OOzie Server, HDFS NameNode, HDFS Secondary NameNode) and 3 DataNode intances (YARN_NodeManager, HDFS_DataNode). 

 

When I run a Teragen Job, the HDFS web serever health status goes to concerning or bad intermittently on datanodes along with the HDFS Canary failed and at the same time the job throws the following error message. 

 

16/04/14 08:34:13 INFO mapreduce.Job: Task Id : attempt_1460553507867_0016_m_000069_0, Status : FAILED
Error: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.201.11:50010], original=[192.168.201.11:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:981)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1047)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1194)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)

After some time, the problematic datanode(s) recovers  itself and  the job gets continued and get finished successfully but the said problem introduces delay and performance degradation. 

 

 

dfs.replication = 3

dfs.blocksize = 512MB 

yarn.nodemanager.resource.memory-mb = 96 GB

yarn.nodemanager.resource.cpu-vcores = 30

yarn.scheduler.maximum-allocation-mb = 96 GB

yarn.scheduler.maximum-allocation-vcores = 30

 

Any ideas or pointers will be appreciated.

 

Thanks,