Support Questions

Find answers, ask questions, and share your expertise

HDC cloud failed on WaitCondition

Expert Contributor

I am trying to launch HDC cloud on AWS. I tried 3/4 times and it has been failing consistently.

  • Using existing VPC and subnet
  • Using existing RDS
  • Using existing key/value pair.
  • It launches the initial instance on the EC2 but I was unable to login to that instance to see logs. It waits for 10 hours and fails.
    "InstanceWaitCondition" : {
       "Type" : "AWS::CloudFormation::WaitCondition",
       "DependsOn" : "Cloudbreak",
       "Properties" : {
          "Handle"  : { "Ref" : "InstanceWaitHandle" },
          "Timeout" : 36000
       }
    },


I am not sure how to debug the problem and find the root cause to fix the problem. Any input is highly appreciated.

14 REPLIES 14

Expert Contributor

hdc-test.pdf

Attaching AWS console cloud formation - status page

Hi @Anandha L Ranganathan

Do your networking settings allow the RDS to communicate with the cloud controller? See

https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.16.0/bk_hdcloud-aws/content/troubl...

There should be a more explicit error message if the connection to RDS was not possible to establish, but just making sure.

Which version of HDCloud are you using?

Expert Contributor
-----> main entry point
-----> restoring motd
-----> retrieving metadata
-----> retrieving region
-----> get profile attribute from cfn metadata of logical resource: Cloudbreak
-----> public ip is: ec2-34-212-149-137.us-west-2.compute.amazonaws.com
-----> using hostname
-----> using existing VPC: true
-----> described internet gateway: igw-ee79d68b
-----> fill Profile
-----> wait for docker ...
-----> check RDS-----> using RDS: hdc-test.cluster-czvrt6ojpbos.us-west-2.rds.amazonaws.com:3306
-----> checking RDS connectivity
/var/lib/cloud/instance/scripts/part-001: line 287: 3: required
-----> installation failed: ERROR: command 'declare host=${1:?required} port=${2:?required} user=${3:?required} password=${4:?required} dbname=${5:?required}' exited with status: 1 line: 1
Error signaling CloudFormation: [Errno None] ('Connection aborted.', gaierror(-3, 'Temporary failure in name resolution'))
  • It seems it has problem connecting to RDS ? RDS instance and ec2 instance are on the same VPC. I am not sure what is missing ? Security groups needs to be modified for RDS ?
  • We are using HDC 1.16.0 version.

Hi @Anandha L Ranganathan

See step 4 in https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.16.0/bk_hdcloud-aws/content/advanc... in case these guidelines help.

I always (1) use the same VPC for RDS and cloud controller and (2) I initially set the Inbound rule on the security group to "0.0.0.0/0" just to avoid any connection errors (since we don't know the IP address of the cloud controller at the point when we create RDS and define the security group settings).

What do you think @Tamas Bihari ?

Rising Star

Hi @Anandha L Ranganathan,

From the last error I think @Dominika Bialek is right and the HDC Controller could not to connect to the RDS service due to network, security rule limitations.

On the other hand could you please give a try with the https://aws.amazon.com/marketplace/pp/B01LXOQBOU?qid=1499963967598&sr=0-2&ref_=srh_res_product_title templates. From the attached pdf and the last comment's logs it looks like you are still on the 1.14.4 version.

You could also checked from the controller that the RDS service is reachable by running the following commands.

Check the domain name:

nslookup hdc-test.cluster-czvrt6ojpbos.us-west-2.rds.amazonaws.com

Check the port is open on the specified machine:

telnet hdc-test.cluster-czvrt6ojpbos.us-west-2.rds.amazonaws.com 3306

Br, Tamas

Rising Star

Hi @Anandha L Ranganathan

Could you please check what @Dominika Bialek has been recommended?

On the other hand if you create a new deployment through the Cloudformation wizzard please set the value of the Options -> Advanced -> "Roll back on failure" to false. Then the Cloudformation won't roll back the resources when something fails and you will be able to SSH to your instance and check the logs of the deployment in the folder "/var/lib/cloudbreak-deployment" by running "cbd logs". Please attach mentioned logs if you have created an HDC deployment with the mentioned additional configs and please also attach the result of the "cbd ps" command.

Thanks,

Tamas

Expert Contributor

Thanks @Dominika Bialek and @Tamas Bihari for your feedback.

we had identified the root cause an it seems unable to resolve DNS name.

  • We were unable to ping other AWS instances on the same VPC. also traceroute to any public websites.
[root@ip-172-17-245-9 ~]# traceroute google.com
google.com: Temporary failure in name resolution


  • We added nameserver 8.8.8.8 in the /etc/resolv.conf and able to ping external world but still unable to ping other AWS instances.
 [root@ip-172-17-245-9 ~]# traceroute google.com
traceroute to google.com (216.58.193.78), 30 hops max, 60 byte packets
 1  * * *
 2  ec2-50-112-0-108.us-west-2.compute.amazonaws.com (50.112.0.108)  18.287 ms ec2-50-112-0-106.us-west-2.compute.amazonaws.com (50.112.0.106)  15.908 ms  15.802 ms


  • We are using existing VPC and subnet. We are using same CIDR (defined by our IT team) to launch any instance using cloudformation.
  • We tested the same in other aws instances and everything works fine. Also able to telnet to RDS instances (Postgres DB). But unable to telnet from instances launched by CF. It has the same nameserver on the /etc/resolve.conf
  • This is the CF template we are using.https://s3.amazonaws.com/awsmp-fulfillment-cf-templates-prod/571fb43d-99f6-4182-8166-61c477473f09.18094323-91c0-4666-9c99-75891fb64424.template
  • Is there a way to see the template history and issues raised on that template on the source repository? Any other pointers are highly appreciated.

Rising Star

Hi @Anandha L Ranganathan,

Maybe the best solution to debug this issue if you create a deployment and set the Options -> Advanced -> Rollback on failure option to false. In this case the deployment could to be deleted manually on the Cloudformation service of AWS after the debug has been finished. This way you can check the applied CF template, the created events and resources at the Cloudformation service.

As I checked the referenced template we only use references for resources when create the Cloudformation template except the public route. The rule is dedicated to allow outgoing connections from the created cbd instance. But probably in your specific network setup that part is not working as we expected, so please check the route table and it's rules. I guess there should be rules that can block the outgoing connections or the Cloudformation reference to wrong gateway and route table in your VPC.

"VPC" : {
      "Type" : "AWS::EC2::VPC",
      "Properties" : {
        "CidrBlock" : "10.0.0.0/16",
        "EnableDnsSupport" : "true",
        "EnableDnsHostnames" : "true",
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "PublicSubnet" : {
      "Type" : "AWS::EC2::Subnet",
      "Properties" : {
        "MapPublicIpOnLaunch" : true,
        "VpcId" : { "Ref" : "VPC" },
        "CidrBlock" : "10.0.0.0/24",
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "InternetGateway" : {
      "Type" : "AWS::EC2::InternetGateway",
      "Properties" : {
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "AttachGateway" : {
       "Type" : "AWS::EC2::VPCGatewayAttachment",
       "Properties" : {
         "VpcId" : { "Ref" : "VPC" },
         "InternetGatewayId" : { "Ref" : "InternetGateway" }
       }
    },


    "PublicRouteTable" : {
      "Type" : "AWS::EC2::RouteTable",
      "Properties" : {
        "VpcId" : { "Ref" : "VPC" },
        "Tags" : [
          { "Key" : "Application", "Value" : { "Ref" : "AWS::StackId" } }
        ]
      }
    },


    "PublicRoute" : {
      "Type" : "AWS::EC2::Route",
      "DependsOn" : [ "PublicRouteTable", "AttachGateway" ],
      "Properties" : {
        "RouteTableId" : { "Ref" : "PublicRouteTable" },
        "DestinationCidrBlock" : "0.0.0.0/0",
        "GatewayId" : { "Ref" : "InternetGateway" }
      }
    },


    "PublicSubnetRouteTableAssociation" : {
      "Type" : "AWS::EC2::SubnetRouteTableAssociation",
      "Properties" : {
        "SubnetId" : { "Ref" : "PublicSubnet" },
        "RouteTableId" : { "Ref" : "PublicRouteTable" }
      }
    },



Expert Contributor

@Tamas Bihari How do I see the applied CF template? I turned off the rollback option to debug. Do you know the location of the template is stored on the instance?

And another point is our test account in AWS, VPC CIDR is 172.16.0.0/x. But in your example, above you have CIDR block address 10.0.0.0/16. Is that something causing the public gateway?

Explorer

For public gateway, it has to be added to the routing table. Check your routing tables and see if it exists (probably won't be created by the template if using an existing VPC). Please see this page in AWS documentation section "Enabling Internet Access": http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Internet_Gateway.html

Other things that may help:

Egress only gateway: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/egress-only-internet-gateway.html

DNS resolution: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html

Explorer

@Anandha L Ranganathan You can find the CF template on the Cloudformation view if you select a stack and choose the template tab. Just to be sure: Did you try to launch instances in the same vpc / subnet as the instances launched by the cloudbreak? From these instances were you able to telnet to RDS instances? Were you able to ping external world and other aws instances in the same vpc / subnet?

Explorer

It looks like there is definitely a routing issue and/or a network ACL that needs to be added or changed. It may be good place to look at AWS support on why hosts from subnet to subnet are having trouble communicating. I’ve run into situations in AWS when manually setting up services to talk to EMR clusters. Once that is found, find out how to set this in your template for future use (I am not that familiar with configuring the templates yet).

@Anandha L Ranganathan Were you able to resolve the issue? I see in the other post that you managed to get to the create cluster stage? What did you do to solve the problem?

Expert Contributor

@Dominika Bialek Still I am unable to launch cloudbreak in Oregon region. I was able to successfully launch cloudbreak in Virginia. But our Virginia region doesn't have proper VPC and subnet setup. In Oregon region, we are multiple subnets (application, public, and mgmt ). I tried in all three subnets but failed to install cloudbreak. We had checked other systems in the region/ availability zone side by side and everything looks similar and we haven't found any differences between the instances launched by cloudbreak and our systems. I checked with our IT security team and everything looks good in the routing table, subnet, and other components. We couldn't figure out the problem.