Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cannot complete client bootstrap using m2.4xlarge

avatar
Contributor

I have been attempting to create a cluster using Cloudera Director v1.0.1. I was able to take the distributed aws.reference.conf generated by the Cloud Formation Launch Cloudera EDH template and build a 48 worker cluster using the type:c3.8xlarge image: ami-18a23f28.

 

Our performance team absolutely will not allow the use of SSD in this cluster, so after looking around and speaking with AWS support, I chose this as my worker nodes:

 

type: m2.4xlarge
image: ami-f032acc0

 

I first tried RHEL 6.4 PVM image ami-b8a63b88 but it didn't work, so I took the AWS suggested image  ami-f032acc0 but it didn't work either.

 

Didn't work means, the host installs worked but the Preparing instances and deploying of CM agents was stuck for 3 hrs before I gave up. For the current generation c3.8xlarge I can get the bootstrap to complete in an hour and a half tops. I have gone through the build/terminate stage several times. 

 

So my question after this long-winded data dump is are the older instances that supported magnetic storage supported using the Director?

 

Hope my description is clear. Thanks for any help.

1 ACCEPTED SOLUTION

avatar
Contributor

The WEB UI page didn't seem to have any history other than canceled, failed due to suspended task but trace to see. Management wants me to punt on director and go to the build it on your own without wizards route.

 

Thanks anyway for the help. Time constraints will not let me follow-up. 

View solution in original post

12 REPLIES 12

avatar
Contributor

So I opted to use the Cloudera Director WEB UI with a smaller problem size. The older generation instance just loops forever waiting for completion. Took 15 mins when using c3.8xlarge. You can see here a snippet of the log file:

 

[2014-11-12 19:35:38] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:38] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:39] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.165]
[2014-11-12 19:35:39] INFO [pipeline-thread-18] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.165
[2014-11-12 19:35:39] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:39] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:40] INFO [pipeline-thread-18] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1083, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:37 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:35:40] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:40] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:41] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:41] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.166]
[2014-11-12 19:35:42] INFO [pipeline-thread-15] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.166
[2014-11-12 19:35:42] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:42] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:42] INFO [pipeline-thread-15] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1085, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:40 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:35:43] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:43] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:43] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:44] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:44] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:44] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.167]
[2014-11-12 19:35:44] INFO [pipeline-thread-14] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.167
[2014-11-12 19:35:44] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:44] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:44] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:44] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.168]
[2014-11-12 19:35:45] INFO [pipeline-thread-17] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.168
[2014-11-12 19:35:45] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:45] INFO [pipeline-thread-14] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1087, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:43 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:35:45] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:47] INFO [pipeline-thread-17] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1089, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:43 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:35:52] INFO [pipeline-thread-16] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:53] INFO [pipeline-thread-16] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:53] INFO [pipeline-thread-16] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:54] INFO [pipeline-thread-16] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.169]
[2014-11-12 19:35:54] INFO [pipeline-thread-16] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.169
[2014-11-12 19:35:54] INFO [pipeline-thread-16] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:54] INFO [pipeline-thread-16] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:54] INFO [pipeline-thread-16] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1091, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:52 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:35:55] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:55] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:56] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:56] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.165]
[2014-11-12 19:35:57] INFO [pipeline-thread-18] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.165
[2014-11-12 19:35:57] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:57] INFO [pipeline-thread-18] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:57] INFO [pipeline-thread-18] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1093, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:55 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:35:58] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:58] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:35:59] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:59] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.166]
[2014-11-12 19:35:59] INFO [pipeline-thread-15] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.166
[2014-11-12 19:35:59] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:35:59] INFO [pipeline-thread-15] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:36:00] INFO [pipeline-thread-15] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1095, name=GlobalHostInstall, startTime=Wed Nov 12 19:35:58 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:36:00] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:36:00] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:36:01] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:36:02] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: >> SetStatusJob/1 [Waiting for Cloudera Manager to deploy agent on 10.0.1.167]
[2014-11-12 19:36:02] INFO [pipeline-thread-14] - com.cloudera.launchpad.pipeline.Job: Waiting for Cloudera Manager to deploy agent on 10.0.1.167
[2014-11-12 19:36:02] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:36:02] INFO [pipeline-thread-14] - c.c.l.p.DatabasePipelineRunner: >> WaitForSuccessOrRetryOnFailure/4 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
[2014-11-12 19:36:02] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: << None{}
[2014-11-12 19:36:02] INFO [pipeline-thread-14] - c.c.l.b.d.UnboundedWaitForApiCommand: Waiting for ApiCommand{id=1097, name=GlobalHostInstall, startTime=Wed Nov 12 19:36:00 EST 2014, endTime=null, active=true, success=null, resultMessage=null, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2014-11-12 19:36:03] INFO [pipeline-thread-17] - c.c.l.p.DatabasePipelineRunner: >> HostInstall/3 [CreateClusterContext{environment=Environment{name='Cray-Product Team 48 Node AWS Cluster', provider ...
^C

 

The WEB UI is sitting here:

 

Status
Performance-Cluster Bootstrapping
2173 / 2184
Installing Cloudera Manager agents on all instances in parallel (20 at a time)

Waiting for Cloudera Manager to deploy agent on 10.0.1.165
Waiting for Cloudera Manager to deploy agent on 10.0.1.168
Waiting for Cloudera Manager to deploy agent on 10.0.1.169
Waiting for Cloudera Manager to deploy agent on 10.0.1.166
Waiting for Cloudera Manager to deploy agent on 10.0.1.167

 

 

avatar
Contributor

So going back to the RHEL ami image I selected originally for the m2.4xlarge type (ami-b8a63b88), here is how the install goes:

 

Cloudera Director 1.0.1 initializing ...
Installing Cloudera Manager ...
* Starting ...... done
* Requesting an instance for Cloudera Manager ................... done
* Running custom bootstrap script on xxxxxxx ...... done
* Inspecting capabilities of xxxxxxx ............... done
* Normalizing xxxxxxx .... done
* Installing ntp (1/2) .... done
* Installing curl (2/2) ..................... done
* Mounting all instance disk drives ........ done
* Resizing instance root partition ..... done
* Rebooting 10.0.1.178 ..... done
* Waiting for 10.0.1.178 to boot ....... done
* Installing repositories for Cloudera Manager ....... done
* Installing jdk (1/4) ..... done
* Installing cloudera-manager-daemons (2/4) .... done
* Installing cloudera-manager-server (3/4) .... done
* Installing cloudera-manager-agent (4/4) ....... done
* Installing cloudera-manager-server-db-2 (1/1) .... done
* Starting embedded PostgreSQL database ....... done
* Starting Cloudera Manager server .... done
* Waiting for Cloudera Manager server to start .... done
* Configuring Cloudera Manager .... done
* Starting Cloudera Management Services ..... done
* Inspecting capabilities of 10.0.1.178 ........ done
* Done ...
Cloudera Manager ready.
Creating cluster Product-Team-AWS ...
* Starting ...... done
* Requesting 9 instance(s) in 4 group(s) ......................... done
* Preparing instances in parallel (20 at a time) ............................................................. done
* Installing Cloudera Manager agents on all instances in parallel (20 at a time) .......... done
* Creating CDH5 cluster using the new instances ... done
* Creating cluster: Product-Team-AWS ... done
* Downloading parcels: CDH-5.2.0-1.cdh5.2.0.p0.36 ... done
* Distributing parcels: CDH-5.2.0-1.cdh5.2.0.p0.36 .... done
* Activating parcels: CDH-5.2.0-1.cdh5.2.0.p0.36 .... done
* Invalid role type(s) specified. Ignored during role creation:
HUE: BEESWAX_SERVER
... done
* Creating Hive Metastore Database ... done
* Waiting on First Run command .... done
* Cloudera Manager 'First Run' command execution failed: Failed to perform First Run of services. ...

 

Here are the final errors:

 

[2014-11-12 21:10:44] ERROR [pipeline-thread-1] - c.c.l.p.DatabasePipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.pipeline.UnrecoverablePipelineError: Cloudera Manager 'First Run' command execution failed: Failed to perform First Run of services.
at com.cloudera.launchpad.bootstrap.deployment.UnboundedWaitForApiCommand.run(UnboundedWaitForApiCommand.java:86) ~[launchpad-bootstrap-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.bootstrap.deployment.UnboundedWaitForApiCommand.run(UnboundedWaitForApiCommand.java:46) ~[launchpad-bootstrap-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) [guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) [guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) [launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) [launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_71]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
[2014-11-12 21:10:44] ERROR [pipeline-thread-1] - c.c.l.p.DatabasePipelineRunner: Encountered an unrecoverable error
com.cloudera.launchpad.pipeline.UnrecoverablePipelineError: Cloudera Manager 'First Run' command execution failed: Failed to perform First Run of services.
at com.cloudera.launchpad.bootstrap.deployment.UnboundedWaitForApiCommand.run(UnboundedWaitForApiCommand.java:86) ~[launchpad-bootstrap-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.bootstrap.deployment.UnboundedWaitForApiCommand.run(UnboundedWaitForApiCommand.java:46) ~[launchpad-bootstrap-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) ~[guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_71]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
[2014-11-12 21:10:44] INFO [pipeline-thread-1] - c.c.l.p.s.PipelineRepositoryService: Pipeline '58692a48-c684-4b4e-8d24-1c0103501d67': RUNNING -> ERROR
[2014-11-12 21:10:45] INFO [pipeline-thread-1] - c.c.l.d.ClusterRepositoryService: Cluster 'Product-Team-AWS': BOOTSTRAP_FAILED -> BOOTSTRAP_FAILED
[2014-11-12 21:10:45] INFO [Thread-2] - o.s.c.a.AnnotationConfigApplicationContext: Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@3274ae84: startup date [Wed Nov 12 20:40:47 EST 2014]; root of context hierarchy
[2014-11-12 21:10:45] INFO [Thread-2] - o.s.o.j.LocalContainerEntityManagerFactoryBean: Closing JPA EntityManagerFactory for persistence unit 'default'

 

 

avatar
Master Collaborator

There is something concerning in the output:

 

* Invalid role type(s) specified. Ignored during role creation:
HUE: BEESWAX_SERVER
... done 

 

Can you please remove BEESWAX_SERVER from the list of roles and retry?

avatar
Master Collaborator

With regard to "First Run" failure - there should another log message above that explains why that command actually failed and during what step. In the Cloudera Manager UI you should be able to see more details in the command history. 

avatar
Contributor

I was running the CLI on this smaller case. I will try re-creating this in the Web UI to get more info on what is failing. Yeah, that BEESWAX error is new. I do have the dump file but I guess not having support there is no way to create a case to attach it to. Thanks, I will try your suggestion in the Web UI.

 

 

avatar
Contributor

The WEB UI page didn't seem to have any history other than canceled, failed due to suspended task but trace to see. Management wants me to punt on director and go to the build it on your own without wizards route.

 

Thanks anyway for the help. Time constraints will not let me follow-up. 

avatar
Contributor

As a final follow-up, I ended up coming back to the Cloudera Director for my cluster build. Just a better, more polished way to create a cluster vs my DIY stitching together of instances.

 

I was able to use the CLI to build a 48 worker node cluster (total 55 instances in all, since I added an extra instance to run the 3rd zookeeper service). I used the cc2.8xlarge as the worker instances. 

 

This cluster will not be a 24/7 config. More like 4 hrs max to run performance suites, stopping and starting on demand then retiring after next week. 

 

Thanks for the assistance and really useful product for a novice like me. Enlightening experience.

avatar
Master Collaborator

That's great! Any improvement ideas for future release? How could we improve the overall experience of using the product? 

avatar
Contributor

I don't have much to offer as far as improvements. I would like to see a way to reset or clean-up from failed updates or other staging failures.

 

Case in point. When I did my final build, I used a smaller number of workers in the bootstrap stage, to make sure a working cluster would come up. I banked that the update would work to bring in the second half of the required workers. During the distributing phase, my final 24 nodes got stuck at 60% for a few hours. I went into Cloudera and tweaked the settings for number of parallel instances being updated with the CHD packages and in the CM, the package state went to 100%, activated.

 

However the CLI UI on the Cloudera Director node doing the update, it never recovered so I CTRL-C'd it . I was able to verify the cluster had all my instances and I was able to manually update the roles on the new workers.

 

However, now when I go to the Clouder Director node to run cloudera-director status aws.reference.conf to get that nice print-out, the director remembers that the last state was terminated in update state and doesn't give me status but rather seems like it wants to continue to success for the update.

 

It would be nice to be able to clear the state so that I could do the status call if I wanted to.

 

An enhancement would be if the update could assign the roles designated in the conf file, but I gather there are reasons it was decided to not attempt to automate this. Having to go through the CM to add 3 roles to a large cluster can be tricky.

 

That's all I got. Minor nitpicks. I am not sure how folks plan to use this in production, since I used it for a single cluster. It will be something I keep in mind if I have to build DIY clusters for myself or my company.

 

I would imagine those folks who want to script building CM clusters on AWS for dev teams, this is a great way to get repeatable configs out to various groups within their company for whatever projects their developers are on. I doubt once you have the process down, if the above nits I pointed out would be of much value but you did ask.