Member since
05-22-2017
15
Posts
6
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5299 | 02-21-2018 10:22 PM |
02-21-2018
10:22 PM
as a quick solution, added auto recovery code to our tool, i.e. check datanode role status after cluster startup completion and run start command for each stopped datanode role. goes like this Welcome to
_ __
| | /| / /______ ____ ___
| |/ |/ / __/ _ `/ _ \(_-<
|__/|__/_/ \_,_/ .__/___/ version 1.0.2
/_/
starting hdfs-DATANODE-ac7041aa53e984590b7d2e27a66ae6ed
starting hdfs-DATANODE-c3a2e16a1c264acbf2b3a8cd036c8abd
starting hdfs-DATANODE-59269ad5f41a6f45c24d9971f1e45660
starting hdfs-DATANODE-47a0b595206a0616ff011606dff76d0f
waiting 30 sec.
HDFS health checks [0]
+-----------------------------------+---------------+
| NAME | SUMMARY |
+-----------------------------------+---------------+
|HDFS_BLOCKS_WITH_CORRUPT_REPLICAS | GOOD |
|HDFS_CANARY_HEALTH | BAD |
|HDFS_DATA_NODES_HEALTHY | CONCERNING |
|HDFS_FAILOVER_CONTROLLERS_HEALTHY | GOOD |
|HDFS_FREE_SPACE_REMAINING | GOOD |
|HDFS_HA_NAMENODE_HEALTH | GOOD |
|HDFS_MISSING_BLOCKS | GOOD |
|HDFS_UNDER_REPLICATED_BLOCKS | GOOD |
+-----------------------------------+---------------+
... View more
02-21-2018
01:54 AM
Ummm... The initial issue continues to happen occasionally. Cluster start command status is like this. It's inconvenient to restart manually that I'm going to automate the detection and recovery process:-)
... View more
02-19-2018
11:41 PM
Hi, csguna. Thanks for your relpy. Yes, mantra is ringing in my head. Just removed terrible directory and finished rebalance, and the block count issue is resolved now. Cluster startup time has returned to normal. Now I'm going to see how it goes.
... View more
02-19-2018
12:56 AM
This is an additional information about our cluster in issue. - CDH version: 5.13.0-1.cdh5.13.0.p0.29 - Cloudera Manager version: Cloudera Express 5.13.0 - Java Version: 1.8.0_151 - NameNode HA - DataNode * 4 - deployed services: HBase, HDFS, Hive, Hue, Oozie, Spark, Spark2 Sqoop2, YARN, ZooKeeper This cluster is for development purpose. We deploy the cluster on cloud (GCP VM instances) and have automated the start/stop process of the cluster. Usually, the cluster is started on demand via transparent shell command, several times a day depending on workloads. This issue is rare, but we have ovserved 3 times in this two weeks, first time since the launch of the cluster last February. We have observerd similar phenomenon with ZooKeeper service startup, which is very rare also.
... View more
02-18-2018
11:03 PM
I noticed our DataNodes has too many blocks due to having many small files created by (maybe) misconfigured Spark jobs. CM gives notice like this one. "Concerning : The DataNode has 989,835 blocks. Warning threshold: 500,000 block(s). " On cluster startup, DataNodes are consuming time checking these blocks before reporting to NameNode. Now the cluster startup time is not tolerable for our daily development cycle (about 10 minutes before HDFS gets ready after cluster services startup complete). Though I'm not confident if this is related to the missing parcel issue, I'm going to resolve this waring first asking users to remove unnecessary files.
... View more
02-13-2018
11:32 PM
Our CDH cluster on cloud sometimes fail to start. Version of CM: Cloudera Express 5.13.0 Version of CDH: 5.13.0-1.cdh5.13.0.p0.29 Contrary to CM Start Command message "All services successfully started", the cluster is not going to start. Inspecting the logs, all the DataNodes failed to start role, showing wierd message below. "This role requires the following additional parcels to be activated before it can start: [cdh]." (on CM Start Command view) > Execute command Start on service HDFS Successfully started HDFS service > Start HDFS service Successfully started service. > Starting 14 roles on service Successfully started service, but only 10/14 roles started. > Execute command Start this DataNode on role DataNode (dn-x) Failed to start role. > Start a role This role requires the following additional parcels to be activated before it can start: [cdh]. (the same for other DataNodes) There seems to be no log output on DataNodes, and I suspect a CM issue. Parcels are distributed and activated properly including cdh parcel. Current workaround is to restart the cluster. At second try, it goes well. I want to give fundamental solution to this issue. I would appreciate any helpful information. Thank you.
... View more
Labels:
- Labels:
-
Cloudera Manager
-
HDFS
05-22-2017
08:59 PM
3 Kudos
Hi, I encountered the same issue, and resolved it. cdh5.9.1 Cloudera Management Service > Configration > Search "Descriptor" Set "Descriptor Fetch Max Tries" to a larger value - 60 (default: 5) I left "Descriptor Fetch Tries Interval" as default - 2 seconds. result (HOSTMONITOR log) 2017-05-23 12:09:56,804 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://cloudera-manager-server:7180 on after 27 tries, sleeping... 2017-05-23 12:09:58,805 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://cloudera-manager-server:7180 on after 28 tries, sleeping... 2017-05-23 12:10:00,806 WARN com.cloudera.cmon.firehose.Main: No descriptor fetched from http://cloudera-manager-server:7180 on after 29 tries, sleeping... 2017-05-23 12:10:04,029 INFO com.cloudera.cmf.BasicScmProxy: Using encrypted credentials for SCM 2017-05-23 12:10:04,182 INFO com.cloudera.cmf.BasicScmProxy: Authenticated to SCM. 2017-05-23 12:10:07,595 INFO com.cloudera.cmon.firehose.Main: SCM descriptor fragments fetched successfully so do other mgmt services Nob.
... View more