02-13-2018 11:32 PM
Our CDH cluster on cloud sometimes fail to start.
Version of CM: Cloudera Express 5.13.0
Version of CDH: 5.13.0-1.cdh5.13.0.p0.29
Contrary to CM Start Command message "All services successfully started", the cluster is not going to start.
Inspecting the logs, all the DataNodes failed to start role, showing wierd message below.
"This role requires the following additional parcels to be activated before it can start: [cdh]."
(on CM Start Command view)
> Execute command Start on service HDFS
Successfully started HDFS service
> Start HDFS service
Successfully started service.
> Starting 14 roles on service
Successfully started service, but only 10/14 roles started.
> Execute command Start this DataNode on role DataNode (dn-x)
Failed to start role.
> Start a role
This role requires the following additional parcels to be activated before it can start: [cdh].
(the same for other DataNodes)
There seems to be no log output on DataNodes, and I suspect a CM issue.
Parcels are distributed and activated properly including cdh parcel.
Current workaround is to restart the cluster.
At second try, it goes well.
I want to give fundamental solution to this issue.
I would appreciate any helpful information. Thank you.
02-18-2018 11:03 PM
I noticed our DataNodes has too many blocks due to having many small files created by (maybe) misconfigured Spark jobs.
CM gives notice like this one.
"Concerning : The DataNode has 989,835 blocks. Warning threshold: 500,000 block(s). "
On cluster startup, DataNodes are consuming time checking these blocks before reporting to NameNode.
Now the cluster startup time is not tolerable for our daily development cycle (about 10 minutes before HDFS gets ready after cluster services startup complete).
Though I'm not confident if this is related to the missing parcel issue, I'm going to resolve this waring first
asking users to remove unnecessary files.
02-19-2018 12:56 AM - edited 02-19-2018 01:00 AM
This is an additional information about our cluster in issue.
- CDH version: 5.13.0-1.cdh5.13.0.p0.29
- Cloudera Manager version: Cloudera Express 5.13.0
- Java Version: 1.8.0_151
- NameNode HA
- DataNode * 4
- deployed services: HBase, HDFS, Hive, Hue, Oozie, Spark, Spark2 Sqoop2, YARN, ZooKeeper
This cluster is for development purpose. We deploy the cluster on cloud (GCP VM instances) and have automated the start/stop process of the cluster. Usually, the cluster is started on demand via transparent shell command, several times a day depending on workloads.
This issue is rare, but we have ovserved 3 times in this two weeks, first time since the launch of the cluster last February.
We have observerd similar phenomenon with ZooKeeper service startup, which is very rare also.
02-19-2018 11:07 PM
having too many small files in the hadoop cluster is against its mantra
few large files works best in hadoop cluster.
I will provide the below link that explains why too many small files is not good for hadoop cluster.
Just curious to what type of small files are those if it is parquet format there are code in github that can merge those files and keep em in the cluster based on your data block size
02-19-2018 11:41 PM
Hi, csguna. Thanks for your relpy.
Yes, mantra is ringing in my head.
Just removed terrible directory and finished rebalance, and the block count issue is resolved now.
Cluster startup time has returned to normal.
Now I'm going to see how it goes.
02-21-2018 01:54 AM
The initial issue continues to happen occasionally.
Cluster start command status is like this.
It's inconvenient to restart manually that I'm going to automate the detection and recovery process:-)
02-21-2018 10:22 PM
as a quick solution, added auto recovery code to our tool, i.e. check datanode role status after cluster startup completion and run start command for each stopped datanode role.
goes like this
Welcome to _ __ | | /| / /______ ____ ___ | |/ |/ / __/ _ `/ _ \(_-< |__/|__/_/ \_,_/ .__/___/ version 1.0.2 /_/ starting hdfs-DATANODE-ac7041aa53e984590b7d2e27a66ae6ed starting hdfs-DATANODE-c3a2e16a1c264acbf2b3a8cd036c8abd starting hdfs-DATANODE-59269ad5f41a6f45c24d9971f1e45660 starting hdfs-DATANODE-47a0b595206a0616ff011606dff76d0f waiting 30 sec. HDFS health checks  +-----------------------------------+---------------+ | NAME | SUMMARY | +-----------------------------------+---------------+ |HDFS_BLOCKS_WITH_CORRUPT_REPLICAS | GOOD | |HDFS_CANARY_HEALTH | BAD | |HDFS_DATA_NODES_HEALTHY | CONCERNING | |HDFS_FAILOVER_CONTROLLERS_HEALTHY | GOOD | |HDFS_FREE_SPACE_REMAINING | GOOD | |HDFS_HA_NAMENODE_HEALTH | GOOD | |HDFS_MISSING_BLOCKS | GOOD | |HDFS_UNDER_REPLICATED_BLOCKS | GOOD | +-----------------------------------+---------------+