Support Questions

Find answers, ask questions, and share your expertise

Ambari 2.4.1 - All is ok single node, All is broken dual node :/

avatar
Contributor

Hi mates,

I'm new to Ambari, testing it for my company as, maybe, our future Hadoop stack management an monitoring plateforme.

All was working ok (tested several Zeppelin tuotrials) on single node but the mess the comes when installing on 2 vagrant-driven Vms ("master1" and "agent1" : ping-tested them together, they can communicate).

Plz note "Master1" VM has 24gb of ram and 4 vcpus and agent vm has 16gb of ram and 1 vcpu. Both vm are running on the same linux workstation. Master1 has a client and server installed, agent1 has the client installed.

Installed HDFS 2.7.3, Yarn 2.7.3, MapReduce2 2.7.3, Zookeeper 3.4.6 , Ambari metrics 0.1.0 on this "mini" cluster

The Yarn ressource manager, perfectly working single node, don't anymore. Crawled the web but at that time didn't find how to figure the problem out. Ca anyone help ?

Regards.

Details below :

Installation :

b.1 => Got several warnings on installation :

On Master1 : App Timeline Server Start, History Server Start, ResourceManager Start, SNameNode Start, Metrics Collector Start, NodeManager Start, Grafana Start On Agent1 : Check ZooKeeper, Check HDFS, NodeManager Start, Check Ambari Metrics, Check YARN, Check MapReduce2

Log files :

First of all, the logs files path for warnings informations given in the "Install, Start and Test" section don't exist : Master : /var/lib/ambari-agent/data/output-*.txt => * ranging from 37 to 43 but the last file is 36 Agent :/var/lib/ambari-agent/data/output-*.txt => * ranging from 4 to 48 but the last file is 36

Run : HDFS, MapReduce2, Zookeeper and Ambari Metrics are said to be running fine by Ambari, but not Yarn RessouceManager wich is said to have stopped after several minutes, but in fact the logs are like this since it started :

2016-10-03 14:52:56,572 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master1.localdomain/192.168.0.50:2181. Will not attempt to authenticate using SASL (unknown error) 2016-10-03 14:52:56,573 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to master1.localdomain/192.168.0.50:2181, initiating session 2016-10-03 14:52:56,574 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1142)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 2016-10-03 14:52:56,675 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1227)) - Exception while executing a ZK operation. 2016-10-03 14:51:16,158 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1230)) - Retrying operation on ZK. Retry no. 695 2016-10-03 14:51:16,584 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master1.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2016-10-03 14:51:16,584 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to master1.localdomain/127.0.0.1:2181, initiating session 2016-10-03 14:51:16,585 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1142)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 2016-10-03 14:51:16,910 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server agent1.localdomain/192.168.0.51:2181. Will not attempt to authenticate using SASL (unknown error) 2016-10-03 14:51:16,911 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to agent1.localdomain/192.168.0.51:2181, initiating session 2016-10-03 14:51:16,912 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1142)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect [root@master1 yarn]#

Logs on the Zookeeper side :

2016-10-03 14:51:16,056 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.0.50:34266 2016-10-03 14:51:16,056 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2016-10-03 14:51:16,056 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /192.168.0.50:34266 (no session established for client) 2016-10-03 14:51:16,584 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:39082 2016-10-03 14:51:16,585 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2016-10-03 14:51:16,585 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:39082 (no session established for client)

1 ACCEPTED SOLUTION

avatar
Contributor

Hi,

After a long long long way searching the web for answers to help me debug this Yarn not starting, i wondered if all this mess can't be a network configuration problem.

The fact is i wasn't even able to find where the right starting logs where for Yarn. The only logs i got was the webui being unavailable.

I so reinstalled the whole cluster, buts this time putting more services on slave Vms hoping another component to fail starting and giving me explicit logs.

it worked.

The explaination : Despite any kind of configuration you use in the VagrantFile, Vagrant will add to etc/hosts a line with the hostname you defined for the machine and the localhost adress, and obviously ,several components of the stack, at a step or another in the setup process, use the first IP feeded in the hosts file neither than the hostname. It explains why everything worked in single node and why their was so many mess in multi-node : some services or log where try-accessed throught 127.0.0.1 adress but where on other machines.

The workaround is to provision an inline shell command like this : "sudo sed -i'' '/^127.0.0.1\\t#{hostname}\\t#{name}$/d' /etc/hosts" in the VagrantFile, then install and setup the Ambari server. trying to correct afterwards is like checking by hand every configuration file of the stack.

This ticket can be closed.

View solution in original post

2 REPLIES 2

avatar
Contributor

please note each vm ping-pong each other and that master1 has port 8080 bind to host 8080 so i can use the UI.

avatar
Contributor

Hi,

After a long long long way searching the web for answers to help me debug this Yarn not starting, i wondered if all this mess can't be a network configuration problem.

The fact is i wasn't even able to find where the right starting logs where for Yarn. The only logs i got was the webui being unavailable.

I so reinstalled the whole cluster, buts this time putting more services on slave Vms hoping another component to fail starting and giving me explicit logs.

it worked.

The explaination : Despite any kind of configuration you use in the VagrantFile, Vagrant will add to etc/hosts a line with the hostname you defined for the machine and the localhost adress, and obviously ,several components of the stack, at a step or another in the setup process, use the first IP feeded in the hosts file neither than the hostname. It explains why everything worked in single node and why their was so many mess in multi-node : some services or log where try-accessed throught 127.0.0.1 adress but where on other machines.

The workaround is to provision an inline shell command like this : "sudo sed -i'' '/^127.0.0.1\\t#{hostname}\\t#{name}$/d' /etc/hosts" in the VagrantFile, then install and setup the Ambari server. trying to correct afterwards is like checking by hand every configuration file of the stack.

This ticket can be closed.