Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cloudbreak failing to create cluster in AWS: Salt Orchestrator Failing

avatar
Guru

This is a single node "cluster" with most of the HDP stack installed on a single 32GB node. I am able to log into the node via SSH and into Ambari UI. I can see that all of the service are up and running and I am able to use service like Hive, Hbase, Atlas, ect. The deployment seems to be successfull but Cloudbreak eventually shows the cluster creation attempt as failed. In the logs it trie sthe following 90 times:

2016-10-19 20:39:53,954 [containerBootstrapBuilderExecutor-9] call:68 WARN  c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Orchestrator component SaltJobIdTracker failed to start, retrying [89/90] Elapsed time: 752 ms; Reason: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt, additional info: SaltJobIdTracker{saltJobRunner=HighStateChecker{BaseSaltJobRunner{target=[10.0.41.157], jid=JobId{jobId='20161019202401136474'}, jobState=IN_PROGRESS}}}
cloudbreak_1   | 2016-10-19 20:40:03,954 [containerBootstrapBuilderExecutor-9] isExitNeeded:87 DEBUG c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] exitCriteriaModel: ClusterDeletionBasedExitCriteriaModel{stackId=6, clusterId=6}, exitCriteria: com.sequenceiq.cloudbreak.core.bootstrap.service.ClusterDeletionBasedExitCriteria@25e33b4f
cloudbreak_1   | 2016-10-19 20:40:03,954 [containerBootstrapBuilderExecutor-9] isExitNeeded:19 DEBUG c.s.c.c.b.s.ClusterDeletionBasedExitCriteria - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Check isExitNeeded for model: ClusterDeletionBasedExitCriteriaModel{stackId=6, clusterId=6}
cloudbreak_1   | 2016-10-19 20:40:03,954 [containerBootstrapBuilderExecutor-9] isExitNeeded:90 DEBUG c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] isExitNeeded: false
cloudbreak_1   | 2016-10-19 20:40:03,955 [containerBootstrapBuilderExecutor-9] call:59 INFO  c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Calling orchestrator bootstrap: SaltJobIdTracker, additional info: SaltJobIdTracker{saltJobRunner=HighStateChecker{BaseSaltJobRunner{target=[10.0.41.157], jid=JobId{jobId='20161019202401136474'}, jobState=IN_PROGRESS}}}
cloudbreak_1   | 2016-10-19 20:40:03,955 [containerBootstrapBuilderExecutor-9] call:43 INFO  c.s.c.o.s.p.SaltJobIdTracker - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Job: 20161019202401136474 is running currently checking the current state.
cloudbreak_1   | 2016-10-19 20:40:04,830 [containerBootstrapBuilderExecutor-9] run:99 INFO  c.s.c.o.s.c.SaltConnector - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Salt run response: RunningJobsResponse{result=[{20161019202401136474={Function=state.highstate, Returned=[], Target-type=glob, Running=[{ip-10-0-41-157.ec2.internal=27848}], User=saltuser, Arguments=[], Target=*}}]}
cloudbreak_1   | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] jobIsRunning:124 INFO  c.s.c.o.s.s.SaltStates - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Active salt jobs: RunningJobsResponse{result=[{20161019202401136474={Function=state.highstate, Returned=[], Target-type=glob, Running=[{ip-10-0-41-157.ec2.internal=27848}], User=saltuser, Arguments=[], Target=*}}]}
cloudbreak_1   | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] checkIsFinished:66 INFO  c.s.c.o.s.p.SaltJobIdTracker - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Job: 20161019202401136474 is running currently, waiting for next polling attempt.
cloudbreak_1   | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] call:68 WARN  c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Orchestrator component SaltJobIdTracker failed to start, retrying [90/90] Elapsed time: 876 ms; Reason: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt, additional info: SaltJobIdTracker{saltJobRunner=HighStateChecker{BaseSaltJobRunner{target=[10.0.41.157], jid=JobId{jobId='20161019202401136474'}, jobState=IN_PROGRESS}}}

After the 90th attempt:

cloudbreak_1   | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] call:78 ERROR c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Orchestrator component failed to start in 90 attempts: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   | 2016-10-19 20:40:04,846 [reactorDispatcher-89] executeRecipes:300 ERROR c.s.c.o.s.SaltOrchestrator - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Error occurred during recipe execution
cloudbreak_1   | java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   |       	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
cloudbreak_1   |       	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.runNewService(SaltOrchestrator.java:274)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.executeRecipes(SaltOrchestrator.java:294)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.postInstallRecipes(SaltOrchestrator.java:233)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.OrchestratorRecipeExecutor.postInstall(OrchestratorRecipeExecutor.java:77)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.RecipeEngine.executePostInstall(RecipeEngine.java:94)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.AmbariClusterConnector.buildAmbariCluster(AmbariClusterConnector.java:223)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.core.cluster.AmbariClusterCreationService.buildAmbariCluster(AmbariClusterCreationService.java:27)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:35)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:18)
cloudbreak_1   |       	at reactor.bus.EventBus$3.accept(EventBus.java:317)
cloudbreak_1   |       	at reactor.bus.EventBus$3.accept(EventBus.java:310)
cloudbreak_1   |       	at reactor.bus.routing.ConsumerFilteringRouter.route(ConsumerFilteringRouter.java:72)
cloudbreak_1   |       	at reactor.bus.routing.TraceableDelegatingRouter.route(TraceableDelegatingRouter.java:51)
cloudbreak_1   |       	at reactor.bus.EventBus.accept(EventBus.java:591)
cloudbreak_1   |       	at reactor.bus.EventBus.accept(EventBus.java:63)
cloudbreak_1   |       	at reactor.core.dispatch.AbstractLifecycleDispatcher.route(AbstractLifecycleDispatcher.java:160)
cloudbreak_1   |       	at reactor.core.dispatch.MultiThreadDispatcher$MultiThreadTask.run(MultiThreadDispatcher.java:74)
cloudbreak_1   |       	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
cloudbreak_1   |       	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
cloudbreak_1   |       	at java.lang.Thread.run(Thread.java:745)
cloudbreak_1   | Caused by: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.poller.SaltJobIdTracker.call(SaltJobIdTracker.java:54)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:60)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:14)
cloudbreak_1   |       	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
cloudbreak_1   |       	... 3 common frames omitted
cloudbreak_1   | 2016-10-19 20:40:04,847 [reactorDispatcher-89] buildAmbariCluster:238 ERROR c.s.c.s.c.f.AmbariClusterConnector - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Error while building the Ambari cluster. Message com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt, throwable: {}
cloudbreak_1   | com.sequenceiq.cloudbreak.core.CloudbreakException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.OrchestratorRecipeExecutor.postInstall(OrchestratorRecipeExecutor.java:80)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.RecipeEngine.executePostInstall(RecipeEngine.java:94)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.AmbariClusterConnector.buildAmbariCluster(AmbariClusterConnector.java:223)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.core.cluster.AmbariClusterCreationService.buildAmbariCluster(AmbariClusterCreationService.java:27)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:35)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:18)
cloudbreak_1   |       	at reactor.bus.EventBus$3.accept(EventBus.java:317)
cloudbreak_1   |       	at reactor.bus.EventBus$3.accept(EventBus.java:310)
cloudbreak_1   |       	at reactor.bus.routing.ConsumerFilteringRouter.route(ConsumerFilteringRouter.java:72)
cloudbreak_1   |       	at reactor.bus.routing.TraceableDelegatingRouter.route(TraceableDelegatingRouter.java:51)
cloudbreak_1   |       	at reactor.bus.EventBus.accept(EventBus.java:591)
cloudbreak_1   |       	at reactor.bus.EventBus.accept(EventBus.java:63)
cloudbreak_1   |       	at reactor.core.dispatch.AbstractLifecycleDispatcher.route(AbstractLifecycleDispatcher.java:160)
cloudbreak_1   |       	at reactor.core.dispatch.MultiThreadDispatcher$MultiThreadTask.run(MultiThreadDispatcher.java:74)
cloudbreak_1   |       	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
cloudbreak_1   |       	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
cloudbreak_1   |       	at java.lang.Thread.run(Thread.java:745)
cloudbreak_1   | Caused by: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.executeRecipes(SaltOrchestrator.java:301)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.postInstallRecipes(SaltOrchestrator.java:233)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.service.cluster.flow.OrchestratorRecipeExecutor.postInstall(OrchestratorRecipeExecutor.java:77)
cloudbreak_1   |       	... 16 common frames omitted
cloudbreak_1   | Caused by: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   |       	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
cloudbreak_1   |       	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.runNewService(SaltOrchestrator.java:274)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.executeRecipes(SaltOrchestrator.java:294)
cloudbreak_1   |       	... 18 common frames omitted
cloudbreak_1   | Caused by: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.salt.poller.SaltJobIdTracker.call(SaltJobIdTracker.java:54)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:60)
cloudbreak_1   |       	at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:14)
cloudbreak_1   |       	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
cloudbreak_1   |       	... 3 common frames omitted
cloudbreak_1   | 2016-10-19 20:40:04,847 [reactorDispatcher-89] accept:70 DEBUG c.s.c.c.f.Flow2Handler - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] flow control event arrived: key: INSTALLCLUSTERFAILED, flowid: ae4d5d90-b13d-4b82-a0d4-0b0a9c75f0ac, payload: com.sequenceiq.cloudbreak.reactor.api.event.cluster.InstallClusterFailed@3af981e

In Cloudbreak UI:

Ambari cluster could not be created. Reason: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt

I am using the default fully open network config for aws.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I think I have found the issue. The debug: True is set in /etc/salt/master.d/custom.conf

....
rest_cherrypy:
  ....
  port: 3080
  ....
  debug: True

Because of this flag the salt-api prints out more useful info for debugging, but it has the side effect that it will reload itself when the underlying code is changed: https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html

I saw the "Restarting because ..." messages in the Salt master log while the Ambari was installing the packages for HDP (e.g. in my case the email/mime/__init__.py was touched and triggered the restart, which was not handled very gracefully by cherrypy and caused the port collision.

2016-10-21 21:25:30,016 [cherrypy.error                           ][INFO    ][3454] [21/Oct/2016:21:25:30] ENGINE Restarting because /usr/lib64/python2.6/email/mime/__init__.py changed.
2016-10-21 21:25:30,016 [cherrypy.error                           ][INFO    ][3454] [21/Oct/2016:21:25:30] ENGINE Stopped thread 'Autoreloader'.
2016-10-21 21:25:30,017 [cherrypy.error                           ][INFO    ][3454] [21/Oct/2016:21:25:30] ENGINE Bus STOPPING
2016-10-21 21:25:30,027 [cherrypy.error                           ][INFO    ][3454] [21/Oct/2016:21:25:30] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('0.0.0.0', 3080)) shut down
2

I was thinking about a workaround, but I was not able to figure out any what you can apply. I will try to make a fix for it.

Attila

View solution in original post

13 REPLIES 13

avatar
Contributor

Is there anything I can do to avoid this? I've re-deployed versions rc-1.7 & rc-1.8 but the result is the same. I'm not able to provision clusters right now. Thanks!

avatar
Expert Contributor

@Vadim Vaks, @Miguel Lucero it has just been fixed in 1.6.1-rc.27. You can update to this version with

cbd update rc-1.6
cbd regenerate
cbd restart

Could you try it out, please?

avatar
Contributor

Thank you! I gave this a shot but unfortunately the result is the same. I have opened a ticket with hwx directly.

avatar
Expert Contributor

Ok, thanks.