Created on 12-12-2017 01:34 PM - edited 09-16-2022 05:37 AM
I am trying (desperately now after 3 days of trying) to provision a vanila install of Director, Manager and a cluster on AWS. Director is up and running fine, but when i try to create a Manager (and my first Cluster) the Bootstrap fails at the end of configuring the manager and errors after saying it failed to install after 5 attempts. I have exhaustively reviewed the application.log on the Director along with the Server and Agent logs on the Manager. The failure occurs when trying to deploy the agent to the Manager (FROM the Manager).
The logs are showing me VERY little as to the cause of this
Agent Log Errors:
[root@ip-192-168-58-68 cloudera-scm-agent]# tail -f cloudera-scm-agent.log | grep ERROR [12/Dec/2017 16:15:42 +0000] 14032 MainThread downloader ERROR Failed rack peer update: [Errno 111] Connection refused [12/Dec/2017 16:15:42 +0000] 14032 MainThread downloader ERROR Failed rack peer update: [Errno 111] Connection refused [12/Dec/2017 16:15:49 +0000] 14032 Monitor-HostMonitor throttling_logger ERROR Timeout with args ['chronyc', 'sources'] [12/Dec/2017 16:15:49 +0000] 14032 Monitor-HostMonitor throttling_logger ERROR Failed to collect NTP metrics [12/Dec/2017 16:18:53 +0000] 15347 Monitor-HostMonitor throttling_logger ERROR Timeout with args ['chronyc', 'sources'] [12/Dec/2017 16:18:53 +0000] 15347 Monitor-HostMonitor throttling_logger ERROR Failed to collect NTP metrics [12/Dec/2017 16:19:46 +0000] 15347 DnsResolutionMonitor throttling_logger ERROR Timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.13.1.jar', 'com.cloudera.cmon.agent.DnsTest'] [12/Dec/2017 16:19:46 +0000] 15347 DnsResolutionMonitor throttling_logger ERROR Failed to run DnsTest.
Cloudera Director Application Log
[2017-12-12 21:10:56.222 +0000] ERROR [port-forwarding-[38267:192.168.58.68:7180]] - - - - - net.schmizz.concurrent.Promise: <<chan#29 / open>> woke to: Opening `direct-tcpip` channel failed: Connection refused
[2017-12-12 21:18:46.531 +0000] ERROR [p-bd15744ec532-DefaultBootstrapDeploymentJob] c52f762e-479e-42c8-be1c-1ee4c0e1aa54 POST /api/v10/environments/cloudera1/deployments com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent$WaitForSuccessOrRetryOnFailure - c.c.l.b.c.BootstrapClouderaManagerAgent: Command GlobalHostInstall with ID 22 failed after 5 tries. Details: ApiCommand{id=22, name=GlobalHostInstall, startTime=Tue Dec 12 21:16:25 UTC 2017, endTime=Tue Dec 12 21:18:45 UTC 2017, active=false, success=false, resultMessage=Command completed with 0/1 successful subcommands, serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2017-12-12 21:18:46.531 +0000] ERROR [p-bd15744ec532-DefaultBootstrapDeploymentJob] c52f762e-479e-42c8-be1c-1ee4c0e1aa54 POST /api/v10/environments/cloudera1/deployments com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent$WaitForSuccessOrRetryOnFailure - c.c.l.pipeline.util.PipelineRunner: Attempt to execute job failed
[2017-12-12 21:18:46.532 +0000] ERROR [p-bd15744ec532-DefaultBootstrapDeploymentJob] c52f762e-479e-42c8-be1c-1ee4c0e1aa54 POST /api/v10/environments/cloudera1/deployments com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent$WaitForSuccessOrRetryOnFailure - c.c.l.p.DatabasePipelineRunner: Encountered an unrecoverable error: JobError{jobClassName=com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent$WaitForSuccessOrRetryOnFailure, jobArguments=[DeploymentContext{environment=Environment{name='cloudera1', provider=InstanceProviderConfig{type='aws'}, credentials=SshCredentials{username='lavastorm', hasPassword=false, hasPrivateKey=true, hasPassphrase=false, port=22, hostKeyFingerprint=Optional.absent(), bastionHost=Optional.absent()}}, deployment=Deployment{name='ClouderaManager', hostname='192.168.58.68', port=7180, username='admin', tlsEnabled=false, tlsConfigurationProperties={}, managerInstance=Optional.of(PluggableComputeInstance{ipAddress=192.168.58.68, delegate=null, hostEndpoints=[HostEndpoint{hostAddressString='192.168.58.68', hostAddress=Optional.of(/192.168.58.68)}, HostEndpoint{hostAddressString='ip-192-168-58-68.ec2.internal', hostAddress=Optional.absent()}, HostEndpoint{hostAddressString='52.90.79.90', hostAddress=Optional.of(/52.90.79.90)}, HostEndpoint{hostAddressString='ec2-52-90-79-90.compute-1.amazonaws.com', hostAddress=Optional.absent()}]} Instance{virtualInstance=VirtualInstance{id='1dcfc459-bb3e-4812-be95-e0e7eebc2e23', template=InstanceTemplate{name='cloudera-Non-Spot', type='m4.large', image='ami-185a260e', rackId='/default', bootstrapScriptsArePresent=false, config={subnetId=subnet-b0e8de9d, ebsOptimized=false, tenancy=default, rootVolumeSizeGB=25, ebsVolumeCount=1, enableEbsEncryption=false, blockDurationMinutes=60, rootVolumeType=gp2, ebsVolumeSizeGiB=25, useSpotInstances=false, ebsVolumeType=gp2, securityGroupsIds=sg-d37be5b7, spotBidUSDPerHr=0.1229}, tags={}, normalizeInstance=true, sshUsername=Optional.of(ec2-user), sshHostKeyRetrievalType=NONE}}, capabilities=Optional.of(Capabilities{operatingSystemType=REDHAT_COMPATIBLE, operatingSystemVersion=REDHAT_COMPATIBLE_7, virtualizationType=HARDWARE_ASSISTED, packageManager=Optional.of(YUM), javaVendor=Optional.absent(), javaVersion=Optional.absent(), pythonVersion=Optional.of(2.7.5), passwordlessSudoEnabled=true, selinuxEnabled=true, iptablesEnabled=false, dnsConfigured=true, fqdn=Optional.of(ip-192-168-58-68.lavastorm.com), clouderaManagerAgentInstalled=false, customScriptPaths={PREPARE_UNMOUNTED_VOLUMES=/var/lib/cloudera-director-plugins/aws-provider-1.4.4/etc/prepare_unmounted_volumes}}), cmHostId=Optional.absent(), cmHostUrl=Optional.absent(), hostKeyFingerprints=[], validationConditions=[], state=InstanceState{status=RUNNING, lastReported=2017-12-12T20:54:55.130Z, lastChecked=2017-12-12T20:54:55.130Z}}), createdExternalDatabases=[], repository='Optional.absent()', repositoryKeyUrl='Optional.absent()', enableEnterpriseTrial=Optional.of(false), unlimitedJce=Optional.absent(), krbAdminUsername='Optional.absent()', javaInstallationStrategy='AUTO', tunnelingRequired=false, cmVersion=Optional.absent()}}, PluggableComputeInstance{ipAddress=192.168.58.68, delegate=null, hostEndpoints=[HostEndpoint{hostAddressString='192.168.58.68', hostAddress=Optional.of(/192.168.58.68)}, HostEndpoint{hostAddressString='ip-192-168-58-68.ec2.internal', hostAddress=Optional.absent()}, HostEndpoint{hostAddressString='52.90.79.90', hostAddress=Optional.of(/52.90.79.90)}, HostEndpoint{hostAddressString='ec2-52-90-79-90.compute-1.amazonaws.com', hostAddress=Optional.absent()}]} Instance{virtualInstance=VirtualInstance{id='1dcfc459-bb3e-4812-be95-e0e7eebc2e23', template=InstanceTemplate{name='cloudera-Non-Spot', type='m4.large', image='ami-185a260e', rackId='/default', bootstrapScriptsArePresent=false, config={subnetId=subnet-b0e8de9d, ebsOptimized=false, tenancy=default, rootVolumeSizeGB=25, ebsVolumeCount=1, enableEbsEncryption=false, blockDurationMinutes=60, rootVolumeType=gp2, ebsVolumeSizeGiB=25, useSpotInstances=false, ebsVolumeType=gp2, securityGroupsIds=sg-d37be5b7, spotBidUSDPerHr=0.1229}, tags={}, normalizeInstance=true, sshUsername=Optional.of(ec2-user), sshHostKeyRetrievalType=NONE}}, capabilities=Optional.of(Capabilities{operatingSystemType=REDHAT_COMPATIBLE, operatingSystemVersion=REDHAT_COMPATIBLE_7, virtualizationType=HARDWARE_ASSISTED, packageManager=Optional.of(YUM), javaVendor=Optional.absent(), javaVersion=Optional.absent(), pythonVersion=Optional.of(2.7.5), passwordlessSudoEnabled=true, selinuxEnabled=true, iptablesEnabled=false, dnsConfigured=true, fqdn=Optional.of(ip-192-168-58-68.lavastorm.com), clouderaManagerAgentInstalled=false, customScriptPaths={PREPARE_UNMOUNTED_VOLUMES=/var/lib/cloudera-director-plugins/aws-provider-1.4.4/etc/prepare_unmounted_volumes}}), cmHostId=Optional.of(004e3678-a159-4358-bbab-1a389ed09e2a), cmHostUrl=Optional.absent(), hostKeyFingerprints=[], validationConditions=[], state=InstanceState{status=RUNNING, lastReported=2017-12-12T20:54:55.130Z, lastChecked=2017-12-12T20:54:55.130Z}}, Optional.of(true), true, 5, 22], jobContext=JobContext{callCountAtThisStackLevel=0, pipelineHandle='b393c4ec-2791-464f-a1bb-bd15744ec532', callStack=CallStack{items=[Item{className='com.cloudera.launchpad.api.jobs.DefaultBootstrapDeploymentJob', callCount=7}, Item{className='com.cloudera.launchpad.bootstrap.deployment.BootstrapClouderaManager', callCount=11}, Item{className='com.cloudera.launchpad.bootstrap.deployment.BootstrapClouderaManager.InstallManagementServices', callCount=2}, Item{className='com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent', callCount=0}], size=4, parent=Optional.absent()}, stackLevel=4}, errorInfo=ErrorInfo{code=CM_AGENT_INSTALLATION_FAIL, properties={instanceIpAddress=192.168.58.68, retryCount=5}, causes=[]}}
[2017-12-12 21:18:46.532 +0000] ERROR [p-bd15744ec532-DefaultBootstrapDeploymentJob] c52f762e-479e-42c8-be1c-1ee4c0e1aa54 POST /api/v10/environments/cloudera1/deployments com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent$WaitForSuccessOrRetryOnFailure - c.c.l.p.DatabasePipelineRunner: Pipeline 'b393c4ec-2791-464f-a1bb-bd15744ec532' failed
[2017-12-12 21:18:46.538 +0000] INFO  [p-bd15744ec532-DefaultBootstrapDeploymentJob] c52f762e-479e-42c8-be1c-1ee4c0e1aa54 POST /api/v10/environments/cloudera1/deployments com.cloudera.launchpad.bootstrap.cluster.BootstrapClouderaManagerAgent$WaitForSuccessOrRetryOnFailure - c.c.l.p.s.PipelineRepositoryService: Pipeline 'b393c4ec-2791-464f-a1bb-bd15744ec532': RUNNING -> ERROR
[2017-12-12 21:18:48.692 +0000] ERROR [p-e84bb5055363-DefaultBootstrapClusterJob] db1ee82c-f036-40c7-87aa-51e51f50c18b POST /api/v10/environments/cloudera1/deployments/ClouderaManager/clusters com.cloudera.launchpad.api.jobs.DefaultBootstrapClusterJob$WaitUntilDeploymentIsReady - c.c.l.pipeline.util.PipelineRunner: Attempt to execute job failed
[2017-12-12 21:18:48.692 +0000] ERROR [p-e84bb5055363-DefaultBootstrapClusterJob] db1ee82c-f036-40c7-87aa-51e51f50c18b POST /api/v10/environments/cloudera1/deployments/ClouderaManager/clusters com.cloudera.launchpad.api.jobs.DefaultBootstrapClusterJob$WaitUntilDeploymentIsReady - c.c.l.p.DatabasePipelineRunner: Encountered an unrecoverable error: JobError{jobClassName=com.cloudera.launchpad.api.jobs.DefaultBootstrapClusterJob$WaitUntilDeploymentIsReady, jobArguments=[Environment{name='cloudera1', provider=InstanceProviderConfig{type='aws'}, credentials=SshCredentials{username='lavastorm', hasPassword=false, hasPrivateKey=true, hasPassphrase=false, port=22, hostKeyFingerprint=Optional.absent(), bastionHost=Optional.absent()}}, {}, DeploymentTemplate{name='ClouderaManager', managerVirtualInstance=Optional.of(VirtualInstance{id='1dcfc459-bb3e-4812-be95-e0e7eebc2e23', template=InstanceTemplate{name='cloudera-Non-Spot', type='m4.large', image='ami-185a260e', rackId='/default', bootstrapScriptsArePresent=false, config={subnetId=subnet-b0e8de9d, ebsOptimized=false, tenancy=default, rootVolumeSizeGB=25, ebsVolumeCount=1, enableEbsEncryption=false, blockDurationMinutes=60, rootVolumeType=gp2, ebsVolumeSizeGiB=25, useSpotInstances=false, ebsVolumeType=gp2, securityGroupsIds=sg-d37be5b7, spotBidUSDPerHr=0.1229}, tags={}, normalizeInstance=true, sshUsername=Optional.of(ec2-user), sshHostKeyRetrievalType=NONE}}), externalDatabaseTemplates={}, externalDatabases={}, configs={}, externalAccounts={}, hostname='Optional.absent()', port=Optional.absent(), username='Optional.of(admin)', tlsEnabled=Optional.absent(), tlsConfigurationProperties={}, repository='Optional.absent()', repositoryKeyUrl='Optional.absent()', enableEnterpriseTrial=Optional.of(false), unlimitedJce=Optional.absent(), krbAdminUsername='Optional.absent()', javaInstallationStrategy='AUTO', licenseIsPresent=false, billingIdIsPresent=false, numberOfPostCreateScripts=0, csds=[]}, ClusterTemplate{name='Cluster1', productVersions={CDH=5}, services=[HDFS, HIVE, HUE, OOZIE, SPARK_ON_YARN, YARN, ZOOKEEPER], servicesConfigs={}, virtualInstanceGroups={masters=VirtualInstanceGroup{name='masters', virtualInstances=[VirtualInstance{id='b2e0fec6-6218-4399-9561-a3173e8b8371', template=InstanceTemplate{name='cloudera-template1', type='m4.large', image='ami-02e98f78', rackId='/default', bootstrapScriptsArePresent=false, config={subnetId=subnet-b0e8de9d, ebsOptimized=false, tenancy=default, rootVolumeSizeGB=25, ebsVolumeCount=1, enableEbsEncryption=false, rootVolumeType=gp2, instanceNamePrefix=director, ebsVolumeSizeGiB=25, useSpotInstances=false, ebsVolumeType=gp2, securityGroupsIds=sg-d37be5b7}, tags={}, normalizeInstance=true, sshUsername=Optional.of(centos), sshHostKeyRetrievalType=NONE}}], serviceTypeToRoleTypes={HIVE=[HIVEMETASTORE, HIVESERVER2], HDFS=[NAMENODE, SECONDARYNAMENODE, BALANCER], OOZIE=[OOZIE_SERVER], HUE=[HUE_SERVER], ZOOKEEPER=[SERVER], YARN=[RESOURCEMANAGER, JOBHISTORY], SPARK_ON_YARN=[SPARK_YARN_HISTORY_SERVER]}, roleTypesConfigs={}, minCount=1}, workers=VirtualInstanceGroup{name='workers', virtualInstances=[VirtualInstance{id='cba2dea6-9788-470d-acf5-1b32f53340c2', template=InstanceTemplate{name='cloudera-template1', type='m4.large', image='ami-02e98f78', rackId='/default', bootstrapScriptsArePresent=false, config={subnetId=subnet-b0e8de9d, ebsOptimized=false, tenancy=default, rootVolumeSizeGB=25, ebsVolumeCount=1, enableEbsEncryption=false, rootVolumeType=gp2, instanceNamePrefix=director, ebsVolumeSizeGiB=25, useSpotInstances=false, ebsVolumeType=gp2, securityGroupsIds=sg-d37be5b7}, tags={}, normalizeInstance=true, sshUsername=Optional.of(centos), sshHostKeyRetrievalType=NONE}}], serviceTypeToRoleTypes={HDFS=[DATANODE], YARN=[NODEMANAGER]}, roleTypesConfigs={}, minCount=1}, gateway=VirtualInstanceGroup{name='gateway', virtualInstances=[VirtualInstance{id='9780aea8-589a-448e-857f-a0b0b22dd18d', template=InstanceTemplate{name='cloudera-template1', type='m4.large', image='ami-02e98f78', rackId='/default', bootstrapScriptsArePresent=false, config={subnetId=subnet-b0e8de9d, ebsOptimized=false, tenancy=default, rootVolumeSizeGB=25, ebsVolumeCount=1, enableEbsEncryption=false, rootVolumeType=gp2, instanceNamePrefix=director, ebsVolumeSizeGiB=25, useSpotInstances=false, ebsVolumeType=gp2, securityGroupsIds=sg-d37be5b7}, tags={}, normalizeInstance=true, sshUsername=Optional.of(centos), sshHostKeyRetrievalType=NONE}}], serviceTypeToRoleTypes={HDFS=[GATEWAY], HIVE=[GATEWAY], YARN=[GATEWAY], SPARK_ON_YARN=[GATEWAY]}, roleTypesConfigs={}, minCount=1}}, externalDatabaseTemplates={}, externalDatabases={}, parcelRepositories=[http://archive.cloudera.com/cdh5/parcels/5.13/, http://archive.cloudera.com/kafka/parcels/3.0/], restartClusterOnUpdate=false, redeployClientConfigsOnUpdate=false, numberOfInstancePostCreateScripts=0, numberOfPostCreateScripts=0, numberOfPreTerminateScripts=0, migrations=0}], jobContext=JobContext{callCountAtThisStackLevel=0, pipelineHandle='3d0e4664-41ab-47b5-9119-e84bb5055363', callStack=CallStack{items=[Item{className='com.cloudera.launchpad.api.jobs.DefaultBootstrapClusterJob', callCount=8}], size=1, parent=Optional.absent()}, stackLevel=1}, errorInfo=ErrorInfo{code=CLUSTER_DEPLOYMENT_IN_WRONG_STAGE, properties={currentStage=BOOTSTRAP_FAILED, deploymentName=ClouderaManager, environmentName=cloudera1}, causes=[]}}
[2017-12-12 21:18:48.693 +0000] ERROR [p-e84bb5055363-DefaultBootstrapClusterJob] db1ee82c-f036-40c7-87aa-51e51f50c18b POST /api/v10/environments/cloudera1/deployments/ClouderaManager/clusters com.cloudera.launchpad.api.jobs.DefaultBootstrapClusterJob$WaitUntilDeploymentIsReady - c.c.l.p.DatabasePipelineRunner: Pipeline '3d0e4664-41ab-47b5-9119-e84bb5055363' failed
[2017-12-12 21:18:48.698 +0000] INFO  [p-e84bb5055363-DefaultBootstrapClusterJob] db1ee82c-f036-40c7-87aa-51e51f50c18b POST /api/v10/environments/cloudera1/deployments/ClouderaManager/clusters com.cloudera.launchpad.api.jobs.DefaultBootstrapClusterJob$WaitUntilDeploymentIsReady - c.c.l.p.s.PipelineRepositoryService: Pipeline '3d0e4664-41ab-47b5-9119-e84bb5055363': RUNNING -> ERRORIt seems the Manager Installs, and i can even see the Director log attempting to connect to the Agent and failing (whilst the Agent is installing) then connecting on Port 7180. I really have no idea why thus us failing and have tried EVERY solution i have found including testing hostname and hostname -f match.
This is in AWS running Director 2.6.1. Deploying into an existing VPC and Subnet. No SELinux/IPTables on the Manager Box. The issues seem to be between the Agent and the Manager on the same server.
Any advice would be greatly appreciated.
Cheers
Andy
Created 01-02-2018 02:40 AM
This issue was down to a missing Reverse DNS Lookup Zone for the subnet deploying the Cloudera environment to. Once the Reverse Lookup Zone was created and the correct entries were added, everything succeeded.
Created 12-19-2017 10:09 AM
Andy,
The best place to look is the agent install logs and the agent logs.
/tmp/scm_prepare_node.<Unique ID>
/var/log/cloudera-scm-agent
You should also check that your security group allows full access from other cluster instances (e.g., from other instances in the same security group).
It also looks like you are using custom DNS, but I still see the .ec2.internal addresses in the HostEndpoint list. If you've set up your DHCP Option Set to point to your own DNS server then you should disable DNS Hostnames and/or DNS Resolution on your VPC.
David
Created 01-02-2018 02:40 AM
This issue was down to a missing Reverse DNS Lookup Zone for the subnet deploying the Cloudera environment to. Once the Reverse Lookup Zone was created and the correct entries were added, everything succeeded.
 
					
				
				
			
		
