Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDC cloud failed on WaitCondition

Highlighted

HDC cloud failed on WaitCondition

Expert Contributor

I am trying to launch HDC cloud on AWS. I tried 3/4 times and it has been failing consistently.

  • Using existing VPC and subnet
  • Using existing RDS
  • Using existing key/value pair.
  • It launches the initial instance on the EC2 but I was unable to login to that instance to see logs. It waits for 10 hours and fails.
    "InstanceWaitCondition" : {
       "Type" : "AWS::CloudFormation::WaitCondition",
       "DependsOn" : "Cloudbreak",
       "Properties" : {
          "Handle"  : { "Ref" : "InstanceWaitHandle" },
          "Timeout" : 36000
       }
    },


I am not sure how to debug the problem and find the root cause to fix the problem. Any input is highly appreciated.

14 REPLIES 14

Re: HDC cloud failed on WaitCondition

Expert Contributor

hdc-test.pdf

Attaching AWS console cloud formation - status page

Re: HDC cloud failed on WaitCondition

Hi @Anandha L Ranganathan

Do your networking settings allow the RDS to communicate with the cloud controller? See

https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.16.0/bk_hdcloud-aws/content/troubl...

There should be a more explicit error message if the connection to RDS was not possible to establish, but just making sure.

Which version of HDCloud are you using?

Re: HDC cloud failed on WaitCondition

Expert Contributor
-----> main entry point
-----> restoring motd
-----> retrieving metadata
-----> retrieving region
-----> get profile attribute from cfn metadata of logical resource: Cloudbreak
-----> public ip is: ec2-34-212-149-137.us-west-2.compute.amazonaws.com
-----> using hostname
-----> using existing VPC: true
-----> described internet gateway: igw-ee79d68b
-----> fill Profile
-----> wait for docker ...
-----> check RDS-----> using RDS: hdc-test.cluster-czvrt6ojpbos.us-west-2.rds.amazonaws.com:3306
-----> checking RDS connectivity
/var/lib/cloud/instance/scripts/part-001: line 287: 3: required
-----> installation failed: ERROR: command 'declare host=${1:?required} port=${2:?required} user=${3:?required} password=${4:?required} dbname=${5:?required}' exited with status: 1 line: 1
Error signaling CloudFormation: [Errno None] ('Connection aborted.', gaierror(-3, 'Temporary failure in name resolution'))
  • It seems it has problem connecting to RDS ? RDS instance and ec2 instance are on the same VPC. I am not sure what is missing ? Security groups needs to be modified for RDS ?
  • We are using HDC 1.16.0 version.

Re: HDC cloud failed on WaitCondition

Hi @Anandha L Ranganathan

See step 4 in https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.16.0/bk_hdcloud-aws/content/advanc... in case these guidelines help.

I always (1) use the same VPC for RDS and cloud controller and (2) I initially set the Inbound rule on the security group to "0.0.0.0/0" just to avoid any connection errors (since we don't know the IP address of the cloud controller at the point when we create RDS and define the security group settings).

What do you think @Tamas Bihari ?

Re: HDC cloud failed on WaitCondition

Rising Star

Hi @Anandha L Ranganathan,

From the last error I think @Dominika Bialek is right and the HDC Controller could not to connect to the RDS service due to network, security rule limitations.

On the other hand could you please give a try with the https://aws.amazon.com/marketplace/pp/B01LXOQBOU?qid=1499963967598&sr=0-2&ref_=srh_res_product_title templates. From the attached pdf and the last comment's logs it looks like you are still on the 1.14.4 version.

You could also checked from the controller that the RDS service is reachable by running the following commands.

Check the domain name:

nslookup hdc-test.cluster-czvrt6ojpbos.us-west-2.rds.amazonaws.com

Check the port is open on the specified machine:

telnet hdc-test.cluster-czvrt6ojpbos.us-west-2.rds.amazonaws.com 3306

Br, Tamas

Re: HDC cloud failed on WaitCondition

Rising Star

Hi @Anandha L Ranganathan

Could you please check what @Dominika Bialek has been recommended?

On the other hand if you create a new deployment through the Cloudformation wizzard please set the value of the Options -> Advanced -> "Roll back on failure" to false. Then the Cloudformation won't roll back the resources when something fails and you will be able to SSH to your instance and check the logs of the deployment in the folder "/var/lib/cloudbreak-deployment" by running "cbd logs". Please attach mentioned logs if you have created an HDC deployment with the mentioned additional configs and please also attach the result of the "cbd ps" command.

Thanks,

Tamas

Re: HDC cloud failed on WaitCondition

Expert Contributor

Thanks @Dominika Bialek and @Tamas Bihari for your feedback.

we had identified the root cause an it seems unable to resolve DNS name.

  • We were unable to ping other AWS instances on the same VPC. also traceroute to any public websites.
[root@ip-172-17-245-9 ~]# traceroute google.com
google.com: Temporary failure in name resolution


  • We added nameserver 8.8.8.8 in the /etc/resolv.conf and able to ping external world but still unable to ping other AWS instances.
 [root@ip-172-17-245-9 ~]# traceroute google.com
traceroute to google.com (216.58.193.78), 30 hops max, 60 byte packets
 1  * * *
 2  ec2-50-112-0-108.us-west-2.compute.amazonaws.com (50.112.0.108)  18.287 ms ec2-50-112-0-106.us-west-2.compute.amazonaws.com (50.112.0.106)  15.908 ms  15.802 ms


  • We are using existing VPC and subnet. We are using same CIDR (defined by our IT team) to launch any instance using cloudformation.
  • We tested the same in other aws instances and everything works fine. Also able to telnet to RDS instances (Postgres DB). But unable to telnet from instances launched by CF. It has the same nameserver on the /etc/resolve.conf
  • This is the CF template we are using.https://s3.amazonaws.com/awsmp-fulfillment-cf-templates-prod/571fb43d-99f6-4182-8166-61c477473f09.18094323-91c0-4666-9c99-75891fb64424.template
  • Is there a way to see the template history and issues raised on that template on the source repository? Any other pointers are highly appreciated.

Re: HDC cloud failed on WaitCondition

Rising Star

Hi @Anandha L Ranganathan,

Maybe the best solution to debug this issue if you create a deployment and set the Options -> Advanced -> Rollback on failure option to false. In this case the deployment could to be deleted manually on the Cloudformation service of AWS after the debug has been finished. This way you can check the applied CF template, the created events and resources at the Cloudformation service.

As I checked the referenced template we only use references for resources when create the Cloudformation template except the public route. The rule is dedicated to allow outgoing connections from the created cbd instance. But probably in your specific network setup that part is not working as we expected, so please check the route table and it's rules. I guess there should be rules that can block the outgoing connections or the Cloudformation reference to wrong gateway and route table in your VPC.

"VPC" : {
      "Type" : "AWS::EC2::VPC",
      "Properties" : {
        "CidrBlock" : "10.0.0.0/16",
        "EnableDnsSupport" : "true",
        "EnableDnsHostnames" : "true",
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "PublicSubnet" : {
      "Type" : "AWS::EC2::Subnet",
      "Properties" : {
        "MapPublicIpOnLaunch" : true,
        "VpcId" : { "Ref" : "VPC" },
        "CidrBlock" : "10.0.0.0/24",
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "InternetGateway" : {
      "Type" : "AWS::EC2::InternetGateway",
      "Properties" : {
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "AttachGateway" : {
       "Type" : "AWS::EC2::VPCGatewayAttachment",
       "Properties" : {
         "VpcId" : { "Ref" : "VPC" },
         "InternetGatewayId" : { "Ref" : "InternetGateway" }
       }
    },


    "PublicRouteTable" : {
      "Type" : "AWS::EC2::RouteTable",
      "Properties" : {
        "VpcId" : { "Ref" : "VPC" },
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "PublicRoute" : {
      "Type" : "AWS::EC2::Route",
      "DependsOn" : [ "PublicRouteTable", "AttachGateway" ],
      "Properties" : {
        "RouteTableId" : { "Ref" : "PublicRouteTable" },
        "DestinationCidrBlock" : "0.0.0.0/0",
        "GatewayId" : { "Ref" : "InternetGateway" }
      }
    },


    "PublicSubnetRouteTableAssociation" : {
      "Type" : "AWS::EC2::SubnetRouteTableAssociation",
      "Properties" : {
        "SubnetId" : { "Ref" : "PublicSubnet" },
        "RouteTableId" : { "Ref" : "PublicRouteTable" }
      }
    },



Re: HDC cloud failed on WaitCondition

Expert Contributor

@Tamas Bihari How do I see the applied CF template? I turned off the rollback option to debug. Do you know the location of the template is stored on the instance?

And another point is our test account in AWS, VPC CIDR is 172.16.0.0/x. But in your example, above you have CIDR block address 10.0.0.0/16. Is that something causing the public gateway?