Support Questions

mike_w_wong · ‎05-18-2018

So, I'm attempting a 'start from scratch' install of HDP2.6.4 on GCP. After 3+ days, I was finally able to get GCP into a state where I was able to get the code installed on my instances.

Then the moment of truth as I logged into Ambari...it was a sea of RED, lit up like Christmas tree! 30+ alerts and the only services running were HDFS, Zookeeper, and Flume

Digging into the Resource Manager, there seems to be a bit of a recurring theme:

Connection failed: [ERRNO 111] Connection refused to <machine_name>:<port>

At first, I thought it was simply because I hadn't opened up those ports in the GCP firewall, so I added them. But I'm still encountering the errors.

Any ideas where I've gone wrong?

mike_w_wong · ‎05-18-2018

All of this look ok

Firewalld status-

● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

May 15 15:09:17 localhost systemd[1]: Starting firewalld - dynamic firewall daemon...
May 15 15:09:19 localhost systemd[1]: Started firewalld - dynamic firewall daemon.
May 15 21:03:24 slave3 systemd[1]: Stopping firewalld - dynamic firewall daemon...
May 15 21:03:25 slave3 systemd[1]: Stopped firewalld - dynamic firewall daemon.

route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.142.0.1      0.0.0.0         UG    100    0        0 eth0
10.142.0.1      0.0.0.0         255.255.255.255 UH    100    0        0 eth0
10.142.0.5      0.0.0.0         255.255.255.255 UH    100    0        0 eth0

Shelton · ‎05-18-2018

@Mike Wong

Can you past the whole /etc/hosts entry? I am interested in the first 4 lines !

You should have some lines like

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

Please confirm

mike_w_wong · ‎05-18-2018

@Geoffrey Shelton Okot

Whole /etc/hosts file-

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.142.0.2 hdp.c.my-project-1519895027175.internal hdp  # Added by Google
169.254.169.254 metadata.google.internal  # Added by Google
35.231.154.250 hdp.c.my-project-1519895027175.internal # Added by Mike Wong
35.231.170.209 slave1.c.my-project-1519895027175.internal #Added by Mike Wong
35.231.220.224 slave2.c.my-project-1519895027175.internal #Added by Mike Wong
35.229.111.57 slave3.c.my-project-1519895027175.internal #Added by Mike Wong

Shelton · ‎05-18-2018

@Mike Wong

The entries look correct 🙂

Can you try to start YARN manually?

su -l yarn -c "/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start resourcemanager"

mike_w_wong · ‎05-18-2018

Hmmm, when I try to start RM, I'm getting this-

2018-05-18 20:26:11,400 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1227)) - Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:326)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:322)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDir(ZKRMStateStore.java:336)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDirRecursively(ZKRMStateStore.java:1311)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.startInternal(ZKRMStateStore.java:303)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStart(RMStateStore.java:598)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:593)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1008)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1049)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1045)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1085)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
2018-05-18 20:26:11,400 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1230)) - Retrying operation on ZK. Retry no. 203
2018-05-18 20:26:11,471 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server slave1.c.my-project-1519895027175.internal/10.142.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
2018-05-18 20:26:11,472 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established, initiating session, client: /10.142.0.3:52748, server: slave1.c.my-project-1519895027175.internal/10.142.0.3:2181 
2018-05-18 20:26:11,472 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(1142)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect

Shelton · ‎05-18-2018

@Mike Wong

I hope this is not production !

1) Stop Resource Manager

2) Connect with ZK server in /usr/hdp/2.x.x/zookeeper the below uout put is from cluster your output could look slightly different depending on installed components

./bin/zkCli.sh[zk: 
localhost:2181(CONNECTED) 0] ls /
[registry, cluster, brokers, storm, zookeeper, infra-solr, hbase-unsecure, admin, isr_change_notification, templeton-hadoop, hiveserver2, controller_epoch, druid, rmstore, ambari-metrics-cluster, consumers, config]
[zk: localhost:2181(CONNECTED) 1] ls /rmstore
[ZKRMStateRoot]
[zk: localhost:2181(CONNECTED) 2]

3) Remove the znode for RM -- rmr /rmstore

[zk: localhost:2181(CONNECTED) 2] rmr /rmstore

Restart YARN

abrahamfikire · ‎03-18-2019

you r the best

mike_w_wong · ‎05-18-2018

1. I'm guessing my RM is already stopped

2. When I try to launch the Zookeeper Cli (./bin/zkCli.sh), I'm getting the following-

Connecting to localhost:2181
2018-05-18 23:18:06,066 - INFO  [main:Environment@100] - Client environment:zookeeper.version=3.4.6-91--1, built on 01/04/2018 10:34 GMT
2018-05-18 23:18:06,068 - INFO  [main:Environment@100] - Client environment:host.name=slave1.c.my-project-1519895027175.internal
2018-05-18 23:18:06,068 - INFO  [main:Environment@100] - Client environment:java.version=1.8.0_112
2018-05-18 23:18:06,070 - INFO  [main:Environment@100] - Client environment:java.vendor=Oracle Corporation
2018-05-18 23:18:06,070 - INFO  [main:Environment@100] - Client environment:java.home=/usr/jdk64/jdk1.8.0_112/jre
2018-05-18 23:18:06,070 - INFO  [main:Environment@100] - Client environment:java.class.path=/usr/hdp/2.6.4.0-91/zookeeper/bin/.......
.
.
.

2018-05-18 23:18:06,094 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1019] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
Welcome to ZooKeeper!
.
.
.

2018-05-18 23:18:10,760 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1019] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
2018-05-18 23:18:10,761 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@864] - Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:33932, server: localhost/0:0:0:0:0:0:0:1:2181
2018-05-18 23:18:10,761 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1142] - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect

Shelton · ‎05-19-2018

@Mike Wong

YES the status shows stopped. Did you do the previous steps and restart the RM manually?

mike_w_wong · ‎05-21-2018

@Geoffrey Shelton Okot

I tried to connect to the Zookeeper server, but I'm getting the above error. (closing socket connection...)

Cloudera Community

Support Questions

Connection failed: [Errno 111] Connection refused...