Support Questions

Find answers, ask questions, and share your expertise

Problem with starting CDH cluster on AWS using Cloudera Manager

avatar
Explorer

Hi,

 

I'm running a POC on AWS using CDH 5.7.2. I have created and configure a simple environment using Cloudera Director as follow :

Cloudera manager

1 x Master

3 x Workers

1 x Gateway

All the 6 instances are m3.xlarge instance type. The installation is smooth and straight foward using cloudera director. After running my jobs for the POC, I stop the cluster from cloudera manager and then stop the instances on EC2 dashboard.

 

When I restart the instances and the cluster, I always get the following error in various order :

Bad : 659 missing blocks in the cluster. 986 total blocks in the cluster. Percentage missing blocks: 66.84%. Critical threshold: any.

 

Bad : 659 under replicated blocks in the cluster. 986 total blocks in the cluster. Percentage under replicated blocks: 66.84%. Critical threshold: 40.00%.

 

Event Server Down (I have to manually start)

Exception while getting fetch configDefaults hash: none
java.net.ConnectException: Connection refused
Failed to publish event: SimpleEvent{attributes={STACKTRACE=[java.net.ConnectExcepion: Connection refused
ERROR   com.cloudera.cmf.eventcatcher.server.EventCatcherService   Could not fetch descriptor after 5 tries, exiting.

 

Host monitor Down (I have to manually start)

 

I consistantly reproduce these errors for every fresh installations I have done:

- At first, all green light

- After stopping the cluster/instances and restarting these errors occur

 

Is there anything wrong with the approach I use to stop & start my cluster ? I've started googling a bit around the missing block issue and understant that it may be related to corrupted files. How to prevent this issue from happening ? Any best practices are welcomed...

 

I've realized that I'm spending more than half of my time actually fixing the environment instead of focusing on my POC.

 

Thanks

 

 

1 ACCEPTED SOLUTION

avatar
Explorer

Hi,

 

I did install from scratch a new cluster using m4 instance type and I could not reproduce the error.

 

Thanks.

View solution in original post

4 REPLIES 4

avatar
Mentor
As you can note on https://aws.amazon.com/ec2/instance-types/ and http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#instance-store-lifetime, the m3.xlarge uses 2x "instance store" type disks, which will be entirely destroyed when you stop an instance. When you bring back your instance, it would not have any of its past persisted data, and that's not acceptable to a lot of CM and CDH components. Your blocks on HDFS would no longer be on the disk so they'd be reported as missing too.

You should instead use instances that provide "EBS" storage so the data persists.

For cloud environment deployments we recommend using Cloudera Director to install, deploy and run your Cloudera CM and CDH cluster instead of manually managing it, to avoid the little problems such as these: https://www.cloudera.com/documentation/director/latest/topics/director_intro.html

You can also checkout what instance types are recommended by Cloudera Director for CM and CDH here: https://www.cloudera.com/documentation/director/latest/topics/director_deployment_requirements.html#...

avatar
Explorer

Hi,

 

Thanks for your reply. I can definitely access the data after start and stop of my instances. In my case, my m3.xlarge instances are attached with an EBS storage device : both my boot and block devices are attached to the same ebs volume. That's also what makes it possible to stop and start the instances.

 

 

Also, as you can read in my initial post, I'm using Cloudera Director and Cloudera Manager for the deployment/management of my CDH cluster.

 

At this stage, I still do not see what's causing the issues I have mentionned above.

 

Regards.

avatar
Expert Contributor

Hi,

 

Are you sure that the blocks are still existing in the DataNode hosts even after rebooting the instances? By default, the location should be under /dfs/dn{1,.2..}.

 

 

 

 

 

 

avatar
Explorer

Hi,

 

I did install from scratch a new cluster using m4 instance type and I could not reproduce the error.

 

Thanks.