Created 10-19-2016 08:58 PM
This is a single node "cluster" with most of the HDP stack installed on a single 32GB node. I am able to log into the node via SSH and into Ambari UI. I can see that all of the service are up and running and I am able to use service like Hive, Hbase, Atlas, ect. The deployment seems to be successfull but Cloudbreak eventually shows the cluster creation attempt as failed. In the logs it trie sthe following 90 times:
2016-10-19 20:39:53,954 [containerBootstrapBuilderExecutor-9] call:68 WARN c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Orchestrator component SaltJobIdTracker failed to start, retrying [89/90] Elapsed time: 752 ms; Reason: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt, additional info: SaltJobIdTracker{saltJobRunner=HighStateChecker{BaseSaltJobRunner{target=[10.0.41.157], jid=JobId{jobId='20161019202401136474'}, jobState=IN_PROGRESS}}} cloudbreak_1 | 2016-10-19 20:40:03,954 [containerBootstrapBuilderExecutor-9] isExitNeeded:87 DEBUG c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] exitCriteriaModel: ClusterDeletionBasedExitCriteriaModel{stackId=6, clusterId=6}, exitCriteria: com.sequenceiq.cloudbreak.core.bootstrap.service.ClusterDeletionBasedExitCriteria@25e33b4f cloudbreak_1 | 2016-10-19 20:40:03,954 [containerBootstrapBuilderExecutor-9] isExitNeeded:19 DEBUG c.s.c.c.b.s.ClusterDeletionBasedExitCriteria - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Check isExitNeeded for model: ClusterDeletionBasedExitCriteriaModel{stackId=6, clusterId=6} cloudbreak_1 | 2016-10-19 20:40:03,954 [containerBootstrapBuilderExecutor-9] isExitNeeded:90 DEBUG c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] isExitNeeded: false cloudbreak_1 | 2016-10-19 20:40:03,955 [containerBootstrapBuilderExecutor-9] call:59 INFO c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Calling orchestrator bootstrap: SaltJobIdTracker, additional info: SaltJobIdTracker{saltJobRunner=HighStateChecker{BaseSaltJobRunner{target=[10.0.41.157], jid=JobId{jobId='20161019202401136474'}, jobState=IN_PROGRESS}}} cloudbreak_1 | 2016-10-19 20:40:03,955 [containerBootstrapBuilderExecutor-9] call:43 INFO c.s.c.o.s.p.SaltJobIdTracker - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Job: 20161019202401136474 is running currently checking the current state. cloudbreak_1 | 2016-10-19 20:40:04,830 [containerBootstrapBuilderExecutor-9] run:99 INFO c.s.c.o.s.c.SaltConnector - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Salt run response: RunningJobsResponse{result=[{20161019202401136474={Function=state.highstate, Returned=[], Target-type=glob, Running=[{ip-10-0-41-157.ec2.internal=27848}], User=saltuser, Arguments=[], Target=*}}]} cloudbreak_1 | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] jobIsRunning:124 INFO c.s.c.o.s.s.SaltStates - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Active salt jobs: RunningJobsResponse{result=[{20161019202401136474={Function=state.highstate, Returned=[], Target-type=glob, Running=[{ip-10-0-41-157.ec2.internal=27848}], User=saltuser, Arguments=[], Target=*}}]} cloudbreak_1 | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] checkIsFinished:66 INFO c.s.c.o.s.p.SaltJobIdTracker - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Job: 20161019202401136474 is running currently, waiting for next polling attempt. cloudbreak_1 | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] call:68 WARN c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Orchestrator component SaltJobIdTracker failed to start, retrying [90/90] Elapsed time: 876 ms; Reason: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt, additional info: SaltJobIdTracker{saltJobRunner=HighStateChecker{BaseSaltJobRunner{target=[10.0.41.157], jid=JobId{jobId='20161019202401136474'}, jobState=IN_PROGRESS}}}
After the 90th attempt:
cloudbreak_1 | 2016-10-19 20:40:04,831 [containerBootstrapBuilderExecutor-9] call:78 ERROR c.s.c.o.OrchestratorBootstrapRunner - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Orchestrator component failed to start in 90 attempts: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | 2016-10-19 20:40:04,846 [reactorDispatcher-89] executeRecipes:300 ERROR c.s.c.o.s.SaltOrchestrator - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Error occurred during recipe execution cloudbreak_1 | java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | at java.util.concurrent.FutureTask.report(FutureTask.java:122) cloudbreak_1 | at java.util.concurrent.FutureTask.get(FutureTask.java:192) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.runNewService(SaltOrchestrator.java:274) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.executeRecipes(SaltOrchestrator.java:294) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.postInstallRecipes(SaltOrchestrator.java:233) cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.OrchestratorRecipeExecutor.postInstall(OrchestratorRecipeExecutor.java:77) cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.RecipeEngine.executePostInstall(RecipeEngine.java:94) cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.AmbariClusterConnector.buildAmbariCluster(AmbariClusterConnector.java:223) cloudbreak_1 | at com.sequenceiq.cloudbreak.core.cluster.AmbariClusterCreationService.buildAmbariCluster(AmbariClusterCreationService.java:27) cloudbreak_1 | at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:35) cloudbreak_1 | at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:18) cloudbreak_1 | at reactor.bus.EventBus$3.accept(EventBus.java:317) cloudbreak_1 | at reactor.bus.EventBus$3.accept(EventBus.java:310) cloudbreak_1 | at reactor.bus.routing.ConsumerFilteringRouter.route(ConsumerFilteringRouter.java:72) cloudbreak_1 | at reactor.bus.routing.TraceableDelegatingRouter.route(TraceableDelegatingRouter.java:51) cloudbreak_1 | at reactor.bus.EventBus.accept(EventBus.java:591) cloudbreak_1 | at reactor.bus.EventBus.accept(EventBus.java:63) cloudbreak_1 | at reactor.core.dispatch.AbstractLifecycleDispatcher.route(AbstractLifecycleDispatcher.java:160) cloudbreak_1 | at reactor.core.dispatch.MultiThreadDispatcher$MultiThreadTask.run(MultiThreadDispatcher.java:74) cloudbreak_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) cloudbreak_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) cloudbreak_1 | at java.lang.Thread.run(Thread.java:745) cloudbreak_1 | Caused by: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.poller.SaltJobIdTracker.call(SaltJobIdTracker.java:54) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:60) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:14) cloudbreak_1 | at java.util.concurrent.FutureTask.run(FutureTask.java:266) cloudbreak_1 | ... 3 common frames omitted cloudbreak_1 | 2016-10-19 20:40:04,847 [reactorDispatcher-89] buildAmbariCluster:238 ERROR c.s.c.s.c.f.AmbariClusterConnector - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] Error while building the Ambari cluster. Message com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt, throwable: {} cloudbreak_1 | com.sequenceiq.cloudbreak.core.CloudbreakException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.OrchestratorRecipeExecutor.postInstall(OrchestratorRecipeExecutor.java:80) cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.RecipeEngine.executePostInstall(RecipeEngine.java:94) cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.AmbariClusterConnector.buildAmbariCluster(AmbariClusterConnector.java:223) cloudbreak_1 | at com.sequenceiq.cloudbreak.core.cluster.AmbariClusterCreationService.buildAmbariCluster(AmbariClusterCreationService.java:27) cloudbreak_1 | at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:35) cloudbreak_1 | at com.sequenceiq.cloudbreak.reactor.handler.cluster.InstallClusterHandler.accept(InstallClusterHandler.java:18) cloudbreak_1 | at reactor.bus.EventBus$3.accept(EventBus.java:317) cloudbreak_1 | at reactor.bus.EventBus$3.accept(EventBus.java:310) cloudbreak_1 | at reactor.bus.routing.ConsumerFilteringRouter.route(ConsumerFilteringRouter.java:72) cloudbreak_1 | at reactor.bus.routing.TraceableDelegatingRouter.route(TraceableDelegatingRouter.java:51) cloudbreak_1 | at reactor.bus.EventBus.accept(EventBus.java:591) cloudbreak_1 | at reactor.bus.EventBus.accept(EventBus.java:63) cloudbreak_1 | at reactor.core.dispatch.AbstractLifecycleDispatcher.route(AbstractLifecycleDispatcher.java:160) cloudbreak_1 | at reactor.core.dispatch.MultiThreadDispatcher$MultiThreadTask.run(MultiThreadDispatcher.java:74) cloudbreak_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) cloudbreak_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) cloudbreak_1 | at java.lang.Thread.run(Thread.java:745) cloudbreak_1 | Caused by: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.executeRecipes(SaltOrchestrator.java:301) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.postInstallRecipes(SaltOrchestrator.java:233) cloudbreak_1 | at com.sequenceiq.cloudbreak.service.cluster.flow.OrchestratorRecipeExecutor.postInstall(OrchestratorRecipeExecutor.java:77) cloudbreak_1 | ... 16 common frames omitted cloudbreak_1 | Caused by: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | at java.util.concurrent.FutureTask.report(FutureTask.java:122) cloudbreak_1 | at java.util.concurrent.FutureTask.get(FutureTask.java:192) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.runNewService(SaltOrchestrator.java:274) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.SaltOrchestrator.executeRecipes(SaltOrchestrator.java:294) cloudbreak_1 | ... 18 common frames omitted cloudbreak_1 | Caused by: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.salt.poller.SaltJobIdTracker.call(SaltJobIdTracker.java:54) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:60) cloudbreak_1 | at com.sequenceiq.cloudbreak.orchestrator.OrchestratorBootstrapRunner.call(OrchestratorBootstrapRunner.java:14) cloudbreak_1 | at java.util.concurrent.FutureTask.run(FutureTask.java:266) cloudbreak_1 | ... 3 common frames omitted cloudbreak_1 | 2016-10-19 20:40:04,847 [reactorDispatcher-89] accept:70 DEBUG c.s.c.c.f.Flow2Handler - [owner:46c1f915-0f53-4567-b233-de54ed271274] [type:STACK] [id:6] [name:credit-fraud-demo] flow control event arrived: key: INSTALLCLUSTERFAILED, flowid: ae4d5d90-b13d-4b82-a0d4-0b0a9c75f0ac, payload: com.sequenceiq.cloudbreak.reactor.api.event.cluster.InstallClusterFailed@3af981e
In Cloudbreak UI:
Ambari cluster could not be created. Reason: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: java.util.concurrent.ExecutionException: com.sequenceiq.cloudbreak.orchestrator.exception.CloudbreakOrchestratorFailedException: Job: JobId{jobId='20161019202401136474'} is running currently, waiting for next polling attempt
I am using the default fully open network config for aws.
Created 10-21-2016 11:01 PM
I think I have found the issue. The debug: True is set in /etc/salt/master.d/custom.conf
.... rest_cherrypy: .... port: 3080 .... debug: True
Because of this flag the salt-api prints out more useful info for debugging, but it has the side effect that it will reload itself when the underlying code is changed: https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
I saw the "Restarting because ..." messages in the Salt master log while the Ambari was installing the packages for HDP (e.g. in my case the email/mime/__init__.py was touched and triggered the restart, which was not handled very gracefully by cherrypy and caused the port collision.
2016-10-21 21:25:30,016 [cherrypy.error ][INFO ][3454] [21/Oct/2016:21:25:30] ENGINE Restarting because /usr/lib64/python2.6/email/mime/__init__.py changed. 2016-10-21 21:25:30,016 [cherrypy.error ][INFO ][3454] [21/Oct/2016:21:25:30] ENGINE Stopped thread 'Autoreloader'. 2016-10-21 21:25:30,017 [cherrypy.error ][INFO ][3454] [21/Oct/2016:21:25:30] ENGINE Bus STOPPING 2016-10-21 21:25:30,027 [cherrypy.error ][INFO ][3454] [21/Oct/2016:21:25:30] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('0.0.0.0', 3080)) shut down 2
I was thinking about a workaround, but I was not able to figure out any what you can apply. I will try to make a fix for it.
Attila
Created 10-24-2016 04:33 PM
Is there anything I can do to avoid this? I've re-deployed versions rc-1.7 & rc-1.8 but the result is the same. I'm not able to provision clusters right now. Thanks!
Created 10-24-2016 09:47 PM
@Vadim Vaks, @Miguel Lucero it has just been fixed in 1.6.1-rc.27. You can update to this version with
cbd update rc-1.6 cbd regenerate cbd restart
Could you try it out, please?
Created 10-25-2016 03:13 PM
Thank you! I gave this a shot but unfortunately the result is the same. I have opened a ticket with hwx directly.
Created 10-26-2016 01:26 PM
Ok, thanks.