Created on 05-18-2018 01:29 AM - edited 08-17-2019 10:44 PM
So, I'm attempting a 'start from scratch' install of HDP2.6.4 on GCP. After 3+ days, I was finally able to get GCP into a state where I was able to get the code installed on my instances.
Then the moment of truth as I logged into Ambari...it was a sea of RED, lit up like Christmas tree! 30+ alerts and the only services running were HDFS, Zookeeper, and Flume
Digging into the Resource Manager, there seems to be a bit of a recurring theme:
Connection failed: [ERRNO 111] Connection refused to <machine_name>:<port>
At first, I thought it was simply because I hadn't opened up those ports in the GCP firewall, so I added them. But I'm still encountering the errors.
Any ideas where I've gone wrong?
Created 05-18-2018 06:15 AM
Yes, your major source of the problem seems to be connectivity. Again just to validate can you walk through the steps you executed or documentation you used. Members tend to ignore basic steps that are vital.
Could you also share or dig into the logs for most of the components in /var/log/{component}/*.log
Created 05-18-2018 06:56 AM
@Geoffrey Shelton Okot
Here's the documentation I used to prepare and install my cluster: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.1.0/bk_ambari-installation/content/ch_Getting_R...
Created 05-18-2018 07:07 AM
Here are some of the errors for the respective components:
Oozie-
Stack trace for the error was (for debug purposes): -------------------------------------- org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/oozie/share/lib/lib_20180518045451/oozie/jackson-databind-2.4.4.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
HiveServer2-
raise WebHDFSCallException(err_msg, result_dict) resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/usr/hdp/2.6.4.0-91/hive/hive.tar.gz -H 'Content-Type: application/octet-stream' 'http://hdp.c.my-project-1519895027175.internal:50070/webhdfs/v1/hdp/apps/2.6.4.0-91/hive/hive.tar.gz?op=CREATE&user.name=hdfs&overwrite=True&permission=444'' returned status_code=403. { "RemoteException": { "exception": "IOException", "javaClassName": "java.io.IOException", "message": "File /hdp/apps/2.6.4.0-91/hive/hive.tar.gz could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.\n\tat
Spark2 History Server-
raise WebHDFSCallException(err_msg, result_dict) resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/tmp/spark2/spark2-hdp-yarn-archive.tar.gz -H 'Content-Type: application/octet-stream' 'http://hdp.c.my-project-1519895027175.internal:50070/webhdfs/v1/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz?op=CREATE&user.name=hdfs&overwrite=True&permission=444'' returned status_code=403. { "RemoteException": { "exception": "IOException", "javaClassName": "java.io.IOException", "message": "File /hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.\n\tat
Zeppelin Notebook-
raise WebHDFSCallException(err_msg, result_dict) resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/usr/hdp/current/zeppelin-server/interpreter/spark/dep/zeppelin-spark-dependencies_2.11-0.7.3.2.6.4.0-91.jar -H 'Content-Type: application/octet-stream' 'http://hdp.c.my-project-1519895027175.internal:50070/webhdfs/v1/apps/zeppelin/zeppelin-spark-dependencies_2.11-0.7.3.2.6.4.0-91.jar?op=CREATE&user.name=hdfs&overwrite=True&permission=444'' returned status_code=403. { "RemoteException": { "exception": "IOException", "javaClassName": "java.io.IOException", "message": "File /apps/zeppelin/zeppelin-spark-dependencies_2.11-0.7.3.2.6.4.0-91.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.\n\tat
Created 05-18-2018 07:12 AM
Please check your /etc/hosts on all nodes. And also if selinux is disabled using getenforce command.
Created 05-18-2018 07:41 AM
Can you describe your cluster OS & node number? Is passwordless connect working?
Your "data nodes have failed and are not registered with the name node", this is network problem.
Is your DNS working can you validate?
Your hostname in /etc/hosts shouldn't be mixed cases!!! Lowercase and FQDN
Created 05-18-2018 08:23 AM
@Sparsh, these four lines are present in the /etc/hosts file on all four of my nodes-
35.231.154.250 hdp.c.my-project-1519895027175.internal # Added by Mike Wong 35.231.170.209 slave1.c.my-project-1519895027175.internal #Added by Mike Wong 35.231.220.224 slave2.c.my-project-1519895027175.internal #Added by Mike Wong 35.229.111.57 slave3.c.my-project-1519895027175.internal #Added by Mike Wong
getenforce resulted in Enforcing on three of my nodes. I've since disabled selinux using 'setenforce 0'. Now all four node are Permissive
Created 05-18-2018 08:25 AM
Cluster OS = RHEL7 VMs on GCP, four nodes total. Yes, SSH is working
I can ping from each node to the other three nodes via IP address and FQDN successfully
Created 05-18-2018 08:42 AM
I'd recommend disabling selinux and rebooting the machines.
Created 05-18-2018 02:21 PM
Ping is not enough. I recommend checking the below settings on all the nodes in the cluster
Disable and stop firewalld
systemctl disable firewalld systemctl stop firewalld
Status of Firewalld
systemctl status firewalld
Check DNS
This should have the correct GW
# route -n
Disable SELinux
sed -ie 's/enforcing/disabled/g' /etc/selinux/config
Then retry