Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Missing activated parcel? Starting cluster stalls due to DataNode role start failure.

avatar
Contributor

Our CDH cluster on cloud sometimes fail to start.

Version of CM:  Cloudera Express 5.13.0
Version of CDH: 5.13.0-1.cdh5.13.0.p0.29

Contrary to CM Start Command message "All services successfully started",  the cluster is not going to start.


Inspecting the logs, all the DataNodes failed to start role, showing wierd message below.


"This role requires the following additional parcels to be activated before it can start: [cdh]."



(on CM Start Command view)

> Execute command Start on service HDFS
   Successfully started HDFS service

   > Start HDFS service
       Successfully started service.

   > Starting 14 roles on service
       Successfully started service, but only 10/14 roles started.

       > Execute command Start this DataNode on role DataNode (dn-x)
           Failed to start role.

           > Start a role
               This role requires the following additional parcels to be activated before it can start: [cdh].


       (the same for other DataNodes)



There seems to be no log output on DataNodes, and I suspect a CM issue.
Parcels are distributed and activated properly including cdh parcel.


Current workaround is to restart the cluster.
At second try, it goes well.

I want to give fundamental solution to this issue. 

I would appreciate any helpful information. Thank you.

1 ACCEPTED SOLUTION

avatar
Contributor

as a quick solution, added auto recovery code to our tool, i.e. check datanode role status after cluster startup completion and run start command for each stopped datanode role.

 

 

goes like this

 

Welcome to
     _      __
    | | /| / /______ ____  ___
    | |/ |/ / __/ _ `/ _ \(_-<
    |__/|__/_/  \_,_/ .__/___/   version 1.0.2
                   /_/

starting hdfs-DATANODE-ac7041aa53e984590b7d2e27a66ae6ed
starting hdfs-DATANODE-c3a2e16a1c264acbf2b3a8cd036c8abd
starting hdfs-DATANODE-59269ad5f41a6f45c24d9971f1e45660
starting hdfs-DATANODE-47a0b595206a0616ff011606dff76d0f
waiting 30 sec.
HDFS health checks [0]
+-----------------------------------+---------------+
|               NAME                |    SUMMARY    |
+-----------------------------------+---------------+
|HDFS_BLOCKS_WITH_CORRUPT_REPLICAS  |     GOOD      |
|HDFS_CANARY_HEALTH                 |      BAD      |
|HDFS_DATA_NODES_HEALTHY            |  CONCERNING   |
|HDFS_FAILOVER_CONTROLLERS_HEALTHY  |     GOOD      |
|HDFS_FREE_SPACE_REMAINING          |     GOOD      |
|HDFS_HA_NAMENODE_HEALTH            |     GOOD      |
|HDFS_MISSING_BLOCKS                |     GOOD      |
|HDFS_UNDER_REPLICATED_BLOCKS       |     GOOD      |
+-----------------------------------+---------------+

View solution in original post

7 REPLIES 7

avatar
Contributor

I noticed our DataNodes has too many blocks due to having many small files created by (maybe) misconfigured Spark jobs.
CM gives notice like this one.


"Concerning : The DataNode has 989,835 blocks. Warning threshold: 500,000 block(s). "


On cluster startup, DataNodes are consuming time checking these blocks before reporting to NameNode.

Now the cluster startup time is not tolerable for our daily development cycle (about 10 minutes before HDFS gets ready after cluster services startup complete).
Though I'm not confident if this is related to the missing parcel issue, I'm going to resolve this waring first
asking users to remove unnecessary files.

avatar
Champion

For datanode block count threshold , trying run the balancer see if  that fixes your problem 

avatar
Contributor

Hi, csguna. Thanks for your relpy.

Yes, mantra is ringing in my head.

Just removed terrible directory and finished rebalance, and the block count issue is resolved now.
Cluster startup time has returned to normal.

Now I'm going to see how it goes.

avatar
Contributor

Ummm...

 

The initial issue continues to happen occasionally.

 

Cluster start command status is like this.

 

cdh-dn-startup-failure.png

 

 

It's inconvenient to restart manually that I'm going to automate the detection and recovery process:-)

 

avatar
Contributor

as a quick solution, added auto recovery code to our tool, i.e. check datanode role status after cluster startup completion and run start command for each stopped datanode role.

 

 

goes like this

 

Welcome to
     _      __
    | | /| / /______ ____  ___
    | |/ |/ / __/ _ `/ _ \(_-<
    |__/|__/_/  \_,_/ .__/___/   version 1.0.2
                   /_/

starting hdfs-DATANODE-ac7041aa53e984590b7d2e27a66ae6ed
starting hdfs-DATANODE-c3a2e16a1c264acbf2b3a8cd036c8abd
starting hdfs-DATANODE-59269ad5f41a6f45c24d9971f1e45660
starting hdfs-DATANODE-47a0b595206a0616ff011606dff76d0f
waiting 30 sec.
HDFS health checks [0]
+-----------------------------------+---------------+
|               NAME                |    SUMMARY    |
+-----------------------------------+---------------+
|HDFS_BLOCKS_WITH_CORRUPT_REPLICAS  |     GOOD      |
|HDFS_CANARY_HEALTH                 |      BAD      |
|HDFS_DATA_NODES_HEALTHY            |  CONCERNING   |
|HDFS_FAILOVER_CONTROLLERS_HEALTHY  |     GOOD      |
|HDFS_FREE_SPACE_REMAINING          |     GOOD      |
|HDFS_HA_NAMENODE_HEALTH            |     GOOD      |
|HDFS_MISSING_BLOCKS                |     GOOD      |
|HDFS_UNDER_REPLICATED_BLOCKS       |     GOOD      |
+-----------------------------------+---------------+

avatar
Contributor

This is an additional information about our cluster in issue.

- CDH version: 5.13.0-1.cdh5.13.0.p0.29
- Cloudera Manager version: Cloudera Express 5.13.0
- Java Version: 1.8.0_151
- NameNode HA
- DataNode * 4
- deployed services: HBase, HDFS, Hive, Hue, Oozie, Spark, Spark2 Sqoop2, YARN, ZooKeeper

This cluster is for development purpose. We deploy the cluster on cloud (GCP VM instances) and have automated the start/stop process of the cluster. Usually, the cluster is started on demand via transparent shell command, several times a day depending on workloads.

This issue is rare, but we have ovserved 3 times in this two weeks, first time since the launch of the cluster last February.
We have observerd similar phenomenon with ZooKeeper service startup, which is very rare also.

avatar
Champion

having too many small files in the hadoop cluster is against its mantra 

few large files works best in hadoop cluster. 

I will provide the below link that explains why too many small files is not good for hadoop cluster. 

 

https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

 

Just curious to what type of small files are those if it is parquet format there are code in github that can merge those files and keep em in the cluster based on your data block size