Created on 05-18-2018 01:29 AM - edited 08-17-2019 10:44 PM
So, I'm attempting a 'start from scratch' install of HDP2.6.4 on GCP. After 3+ days, I was finally able to get GCP into a state where I was able to get the code installed on my instances.
Then the moment of truth as I logged into Ambari...it was a sea of RED, lit up like Christmas tree! 30+ alerts and the only services running were HDFS, Zookeeper, and Flume
Digging into the Resource Manager, there seems to be a bit of a recurring theme:
Connection failed: [ERRNO 111] Connection refused to <machine_name>:<port>
At first, I thought it was simply because I hadn't opened up those ports in the GCP firewall, so I added them. But I'm still encountering the errors.
Any ideas where I've gone wrong?
Created 05-18-2018 03:36 PM
All of this look ok
Firewalld status-
● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: man:firewalld(1) May 15 15:09:17 localhost systemd[1]: Starting firewalld - dynamic firewall daemon... May 15 15:09:19 localhost systemd[1]: Started firewalld - dynamic firewall daemon. May 15 21:03:24 slave3 systemd[1]: Stopping firewalld - dynamic firewall daemon... May 15 21:03:25 slave3 systemd[1]: Stopped firewalld - dynamic firewall daemon.
route -n
Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.142.0.1 0.0.0.0 UG 100 0 0 eth0 10.142.0.1 0.0.0.0 255.255.255.255 UH 100 0 0 eth0 10.142.0.5 0.0.0.0 255.255.255.255 UH 100 0 0 eth0
Created 05-18-2018 05:31 PM
Can you past the whole /etc/hosts entry? I am interested in the first 4 lines !
You should have some lines like
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
Please confirm
Created 05-18-2018 07:37 PM
Whole /etc/hosts file-
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.142.0.2 hdp.c.my-project-1519895027175.internal hdp # Added by Google 169.254.169.254 metadata.google.internal # Added by Google 35.231.154.250 hdp.c.my-project-1519895027175.internal # Added by Mike Wong 35.231.170.209 slave1.c.my-project-1519895027175.internal #Added by Mike Wong 35.231.220.224 slave2.c.my-project-1519895027175.internal #Added by Mike Wong 35.229.111.57 slave3.c.my-project-1519895027175.internal #Added by Mike Wong
Created 05-18-2018 07:57 PM
The entries look correct 🙂
Can you try to start YARN manually?
su -l yarn -c "/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start resourcemanager"
Created 05-18-2018 08:32 PM
Hmmm, when I try to start RM, I'm getting this-
2018-05-18 20:26:11,400 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1227)) - Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:326) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:322) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDir(ZKRMStateStore.java:336) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDirRecursively(ZKRMStateStore.java:1311) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.startInternal(ZKRMStateStore.java:303) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStart(RMStateStore.java:598) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:593) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1008) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1049) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1045) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1085) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) 2018-05-18 20:26:11,400 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1230)) - Retrying operation on ZK. Retry no. 203 2018-05-18 20:26:11,471 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server slave1.c.my-project-1519895027175.internal/10.142.0.3:2181. Will not attempt to authenticate using SASL (unknown error) 2018-05-18 20:26:11,472 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established, initiating session, client: /10.142.0.3:52748, server: slave1.c.my-project-1519895027175.internal/10.142.0.3:2181 2018-05-18 20:26:11,472 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1142)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
Created 05-18-2018 08:45 PM
I hope this is not production !
1) Stop Resource Manager
2) Connect with ZK server in /usr/hdp/2.x.x/zookeeper the below uout put is from cluster your output could look slightly different depending on installed components
./bin/zkCli.sh[zk: localhost:2181(CONNECTED) 0] ls / [registry, cluster, brokers, storm, zookeeper, infra-solr, hbase-unsecure, admin, isr_change_notification, templeton-hadoop, hiveserver2, controller_epoch, druid, rmstore, ambari-metrics-cluster, consumers, config] [zk: localhost:2181(CONNECTED) 1] ls /rmstore [ZKRMStateRoot] [zk: localhost:2181(CONNECTED) 2]
3) Remove the znode for RM -- rmr /rmstore
[zk: localhost:2181(CONNECTED) 2] rmr /rmstore
Restart YARN
Created 03-18-2019 12:00 PM
you r the best
Created on 05-18-2018 11:12 PM - edited 08-17-2019 10:43 PM
1. I'm guessing my RM is already stopped
2. When I try to launch the Zookeeper Cli (./bin/zkCli.sh), I'm getting the following-
Connecting to localhost:2181 2018-05-18 23:18:06,066 - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.6-91--1, built on 01/04/2018 10:34 GMT 2018-05-18 23:18:06,068 - INFO [main:Environment@100] - Client environment:host.name=slave1.c.my-project-1519895027175.internal 2018-05-18 23:18:06,068 - INFO [main:Environment@100] - Client environment:java.version=1.8.0_112 2018-05-18 23:18:06,070 - INFO [main:Environment@100] - Client environment:java.vendor=Oracle Corporation 2018-05-18 23:18:06,070 - INFO [main:Environment@100] - Client environment:java.home=/usr/jdk64/jdk1.8.0_112/jre 2018-05-18 23:18:06,070 - INFO [main:Environment@100] - Client environment:java.class.path=/usr/hdp/2.6.4.0-91/zookeeper/bin/....... . . . 2018-05-18 23:18:06,094 - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1019] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) Welcome to ZooKeeper! . . . 2018-05-18 23:18:10,760 - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1019] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) 2018-05-18 23:18:10,761 - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@864] - Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:33932, server: localhost/0:0:0:0:0:0:0:1:2181 2018-05-18 23:18:10,761 - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1142] - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
Created 05-19-2018 06:49 AM
YES the status shows stopped. Did you do the previous steps and restart the RM manually?
Created 05-21-2018 03:17 AM
I tried to connect to the Zookeeper server, but I'm getting the above error. (closing socket connection...)