Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Connection failed: [Errno 111] Connection refused...

avatar
Contributor

So, I'm attempting a 'start from scratch' install of HDP2.6.4 on GCP. After 3+ days, I was finally able to get GCP into a state where I was able to get the code installed on my instances.

Then the moment of truth as I logged into Ambari...it was a sea of RED, lit up like Christmas tree! 30+ alerts and the only services running were HDFS, Zookeeper, and Flume

Digging into the Resource Manager, there seems to be a bit of a recurring theme:

Connection failed: [ERRNO 111] Connection refused to <machine_name>:<port>

At first, I thought it was simply because I hadn't opened up those ports in the GCP firewall, so I added them. But I'm still encountering the errors.

Any ideas where I've gone wrong?

74482-screen-shot-2018-05-17-at-92119-pm.png

74481-screen-shot-2018-05-17-at-92047-pm.png

21 REPLIES 21

avatar
Master Mentor

@Mike Wong

Yes, your major source of the problem seems to be connectivity. Again just to validate can you walk through the steps you executed or documentation you used. Members tend to ignore basic steps that are vital.

Could you also share or dig into the logs for most of the components in /var/log/{component}/*.log

avatar
Contributor

@Geoffrey Shelton Okot

Here's the documentation I used to prepare and install my cluster: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.1.0/bk_ambari-installation/content/ch_Getting_R...

avatar
Contributor

Here are some of the errors for the respective components:

Oozie-

Stack trace for the error was (for debug purposes):
--------------------------------------
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/oozie/share/lib/lib_20180518045451/oozie/jackson-databind-2.4.4.jar could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s) are excluded in this operation.

HiveServer2-

raise WebHDFSCallException(err_msg, result_dict)
resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/usr/hdp/2.6.4.0-91/hive/hive.tar.gz -H 'Content-Type: application/octet-stream' 'http://hdp.c.my-project-1519895027175.internal:50070/webhdfs/v1/hdp/apps/2.6.4.0-91/hive/hive.tar.gz?op=CREATE&user.name=hdfs&overwrite=True&permission=444'' returned status_code=403. 
{
  "RemoteException": {
    "exception": "IOException", 
    "javaClassName": "java.io.IOException", 
    "message": "File /hdp/apps/2.6.4.0-91/hive/hive.tar.gz could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s) are excluded in this operation.\n\tat 

Spark2 History Server-

raise WebHDFSCallException(err_msg, result_dict)
resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/tmp/spark2/spark2-hdp-yarn-archive.tar.gz -H 'Content-Type: application/octet-stream' 'http://hdp.c.my-project-1519895027175.internal:50070/webhdfs/v1/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz?op=CREATE&user.name=hdfs&overwrite=True&permission=444'' returned status_code=403. 
{
  "RemoteException": {
    "exception": "IOException", 
    "javaClassName": "java.io.IOException", 
    "message": "File /hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz could only be replicated to 0 nodes instead of minReplication (=1).  There are 3 datanode(s) running and 3 node(s) are excluded in this operation.\n\tat 

Zeppelin Notebook-

raise WebHDFSCallException(err_msg, result_dict)
resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/usr/hdp/current/zeppelin-server/interpreter/spark/dep/zeppelin-spark-dependencies_2.11-0.7.3.2.6.4.0-91.jar -H 'Content-Type: application/octet-stream' 'http://hdp.c.my-project-1519895027175.internal:50070/webhdfs/v1/apps/zeppelin/zeppelin-spark-dependencies_2.11-0.7.3.2.6.4.0-91.jar?op=CREATE&user.name=hdfs&overwrite=True&permission=444'' returned status_code=403. 
{
  "RemoteException": {
    "exception": "IOException", 
    "javaClassName": "java.io.IOException", 
    "message": "File /apps/zeppelin/zeppelin-spark-dependencies_2.11-0.7.3.2.6.4.0-91.jar could only be replicated to 0 nodes instead of minReplication (=1).  There are 3 datanode(s) running and 3 node(s) are excluded in this operation.\n\tat 

avatar
Contributor

@Mike Wong

Please check your /etc/hosts on all nodes. And also if selinux is disabled using getenforce command.

avatar
Master Mentor

@Mike Wong

Can you describe your cluster OS & node number? Is passwordless connect working?

Your "data nodes have failed and are not registered with the name node", this is network problem.

Is your DNS working can you validate?

Your hostname in /etc/hosts shouldn't be mixed cases!!! Lowercase and FQDN

avatar
Contributor

@Sparsh, these four lines are present in the /etc/hosts file on all four of my nodes-

35.231.154.250 hdp.c.my-project-1519895027175.internal # Added by Mike Wong
35.231.170.209 slave1.c.my-project-1519895027175.internal #Added by Mike Wong
35.231.220.224 slave2.c.my-project-1519895027175.internal #Added by Mike Wong
35.229.111.57 slave3.c.my-project-1519895027175.internal #Added by Mike Wong

getenforce resulted in Enforcing on three of my nodes. I've since disabled selinux using 'setenforce 0'. Now all four node are Permissive

avatar
Contributor

@Geoffrey Shelton Okot

Cluster OS = RHEL7 VMs on GCP, four nodes total. Yes, SSH is working

I can ping from each node to the other three nodes via IP address and FQDN successfully

avatar
Contributor
@Mike Wong

I'd recommend disabling selinux and rebooting the machines.

After that look into hdfs logs make sure hdfs is up with no alerts. Try restarting hdfs. All other services would come up then.

avatar
Master Mentor

@Mike Wong

Ping is not enough. I recommend checking the below settings on all the nodes in the cluster

Disable and stop firewalld

systemctl disable firewalld
systemctl stop firewalld

Status of Firewalld

systemctl status firewalld

Check DNS

This should have the correct GW

# route -n

Disable SELinux

sed -ie 's/enforcing/disabled/g' /etc/selinux/config

Then retry