Support Questions

dturner · ‎07-10-2019

We have a need to enable public IPs on the master nodes due to an requirement in Azure when adding instances to a load balancer pool. Our cluster is accessed via private IPs only, outside of this requirement.

Within the cluster bootstrap file, we enable public IPs for the master instance groups. During deployment, director will throw an error similar to the following for some but not all master nodes:

', errorInfo=ErrorInfo{code=INSTANCE_SSH_PORT_UNAVAILABLE, properties={sshServiceEndpoints=[BaseServiceEndpoint{hostEndpoint=HostEndpoint{hostAddressString='10.0.14.15', hostAddress=Optional.of(/10.0.14.15)}, port=Optional.absent(), url=Optional.absent()}, BaseServiceEndpoint{hostEndpoint=HostEndpoint{hostAddressString='52.232.245.168', hostAddress=Optional.of(/52.232.245.168)}, port=Optional.absent(), url=Optional.absent()}]}, causes=[]}}

[2019-07-10 02:49:45.507 +0000] INFO [p-d4ed238466ce-WaitForSshToSucceed] 5f79b1c9-b7d6-41d0-a933-0cfff86cf6e4 POST /api/d6.2/environments/azuresb/deployments/azuresb-1/clusters com.cloudera.launchpad.bootstrap.WaitForServersUntilTime - com.cloudera.launchpad.bootstrap.WaitForServersUntilTime: Waiting until 2019-07-10T03:09:45.038Z for an accessible port on endpoints [10.0.14.15:22, 40.79.57.122:22]

From the logs, it look like Director should try both IP's. I confirmed that I can ssh to the instance's private IP from Director - which leads me to believe it tried to SSH via the public IP, failed, and never tried to connect via the private IP. As I mentioned, this does not happen for all masters, however all masters are configured identically - with a private and public IP.

Can someone confirm if both IP's would need to be accessible from the Director server for the installation to continue?

dturner · ‎08-16-2019

We discovered that there were sporadic network issues in the tunnel between Azure and AWS ( our director instance is in AWS). Our assumption is that this was causing transient connection issues between director and Azure instances. Declaring issue solved for now as we no longer are experiencing it.

View solution in original post

dhan · ‎07-11-2019

You are correct in that Director will attempt to connect via each IP in the list, preferring to connect over the private IP. You should be fine if only one of the IPs is accessible. Director should cycle through the IPs until the configured timeout, default 20 minutes after instance allocation.

Is the cluster failing to bootstrap? There are some natural places where we would expect connectivity to fail in a transient manner. E.g., when the VM is first allocated but doesn't yet respond to ssh, when rebooting the VM. If you are seeing failures that are preventing the cluster from bootstrapping (or even causing individual instances to fail) then we would be interested in seeing the log files to investigate.

dturner · ‎07-18-2019

Thanks for your reply.

Bootstrap did fail in this case, which surprised me given the fact that during the retry window I could ssh to the instance's private IP from Director. Also, Director had no problem connecting to the other 2 master nodes that I enabled public IP on. The log entries I provided are repeated over and over until the retries are exhausted.

jadair · ‎07-18-2019

It would be helpful for us to see the logs. There should be additional info logging for the individual connection attempts ("Attempting connection to <endpoint>").

dturner · ‎07-18-2019

I'm in the middle of another deployment where I see random failures reported by director when attempting to connect to port 22 of cluster instances during bootstrap. I'm seeing this for deployments where I haven't enabled publicip - so it's no just for publicip enabled deployments. Each time, it's only 1 or 2 of the instances and each time I can connect to the instance from the director server just fine, even during the retry window when director still claims it cannot connect. Not sure why director thinks it can't connect when I clearly can from the same server. I've confirmed the forward/reverse DNS resolution is working. I basically have to deploy several times before I hit the sweet spot and the deployment completes. This type of inconsistent behavior has been systemic with Azure deployments. I've experienced none of this with AWS deployments. I'll attempt to collect more logs.

dturner · ‎08-16-2019

We discovered that there were sporadic network issues in the tunnel between Azure and AWS ( our director instance is in AWS). Our assumption is that this was causing transient connection issues between director and Azure instances. Declaring issue solved for now as we no longer are experiencing it.

Cloudera Community

Support Questions

Altus Director 6.2: Failure to connect to Azure instance with public IP during deployment