Reply
Explorer
Posts: 9
Registered: ‎10-18-2016

Cloudera-scm-agent Using RegionServer Port

[ Edited ]

The RegionServer service stopped on one of our nodes. I attempted to start the role again on a node and got this error message: 

 

Caused by: java.io.IOException: Problem binding to /0.0.0.0:60020 : Address already in use. To switch ports use the 'hbase.regionserver.port' configuration property.
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.<init>(RSRpcServices.java:930)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.createRpcServices(HRegionServer.java:653)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:537)
        ... 10 more
Caused by: java.net.BindException: Address already in use

Cloudera-scm-agent is supposed to be listening on port 9000, but when I checked netstat I saw its process was listening on port 60020. I had to restart cloudera-scm-agent, and then I could start the regionserver again. 

 

Why would cloudera-scm-agent switch to listening on the wrong port? I checked /etc/cloudera-scm-agent/config.ini and verified that port 9000 is configured there. This isn't the first time it happened either. I've had the same thing happen to Datanodes. 

 

If it matters, right before the service shut down I see these errors in the logs:

2016-11-11 21:50:22,942 ERROR org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x7a197256): failed to release short-circuit shared memory slot Slot(slotIdx=114, shm=DfsClientShm(ddc3b8e3cd58475c872a4ecb254c32ea)) by sending ReleaseShortCircuitAccessRequestProto to /var/run/hdfs-sockets/dn.  Closing shared memory segment.
java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId ddc3b8e3cd58475c872a4ecb254c32ea

and

2016-11-11 21:50:24,395 ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close inode 180549916
java.net.ConnectException: Connection refused

This is why the Regionserver service stopped I think, which is a differrent problem, but I don't understand why cloudera-scm-agent would have started using its port. 

 

 

Explorer
Posts: 9
Registered: ‎10-18-2016

Re: Cloudera-scm-agent Using RegionServer Port

Update:

 

This is happening on Datanodes too when I restart that role. Here are my observations so far:

Prior to shutdown, I checked processes and verified cmf-agent running

root     14814     1 10 Oct14 ?        3-15:19:42 python2.6 /usr/lib64/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib64/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-scm-agent.log --daemon --comm_name cmf-agent --pidfile /var/run/cloudera-scm-agent/cloudera-scm-agent.pid
Posts: 998
Topics: 1
Kudos: 249
Solutions: 126
Registered: ‎04-22-2014

Re: Cloudera-scm-agent Using RegionServer Port

The agent's listening port is configured in the /etc/cloudera-scm-agent/config.ini file.  The default is 9000:

 

# listening_port=9000

 

Unless you configure the agent to listen on another port, it will listen on 9000 and can't change while it is running

 

Could you show the steps you take that lead you to believe that the agent is usurping the other roles' ports?

 

It sounds more likely that that the supervisor reported that the process had stopped, so the agent relayed the information to Cloudera Manager.  The process, though, may not have stopped.

I'd check for errors/information in the agent log (/var/log/cloudera-scm-agent/cloudera-scm-agent.log) when you tried stopping or restarting the roles.

 

 

Explorer
Posts: 9
Registered: ‎10-18-2016

Re: Cloudera-scm-agent Using RegionServer Port

Thanks for your response. I was actually in the middle of posting a follow-up because I was restarting Datanodes one by one and noticed the same thing happening. Here are the steps I took to observe the behavior the second time it happened:

 

- ps to see what pid cloudera-scm-agent was using

root     14814     1 10 Oct14 ?        3-15:19:42 python2.6 /usr/lib64/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib64/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-scm-agent.log --daemon --comm_name cmf-agent --pidfile /var/run/cloudera-scm-agent/cloudera-scm-agent.pid

 

- run netstat to see what was listening on the Datanode port, 50020

 netstat -tapn | grep 50020
tcp        0      0 10.192.168.1.22:50020         0.0.0.0:*                   LISTEN      16876/java

 

- Restart Datanode role for host in Cloudera Manager. The role stops successfully, but doesn't start up again. Error message is 

java.net.BindException: Problem binding to [datanode3.companynet:50020] java.net.BindException: Address already in use;

 

- On the Datanode, run netstat for port 50020 again and notice that a different process, python 2.6 which has the same pid as cloudera-scm-agent, is now listening

netstat -tapn | grep 50020
tcp     1430      0 192.168.1.22:50020         192.168.1.31:7180             CLOSE_WAIT  14814/python2.6

 

- Restart cloudera-scm-agent, and then start the Datanode from Cloudera Manager. Everything is happy again. 

 

I read http://blog.cloudera.com/blog/2013/07/how-does-cloudera-manager-work/ and learned some useful info. Looking at the various pids I now understand that the datanode java process I see is the child of supervisord, launched by python. So if I'm understanding correctly, what I thought was a port hijacking is actually the Datanode jvm being shut down, but supervisord not shutting down the app itself properly, which is why port 50020 (in this example) is showing in CLOSE_WAIT. Is this something that I can control in my agent or role configuration? Am I on the right track? 

Posts: 998
Topics: 1
Kudos: 249
Solutions: 126
Registered: ‎04-22-2014

Re: Cloudera-scm-agent Using RegionServer Port

Wow, this is kind of wacky.

 

First, I'd check to see if there is a Datanode running with "ps" If there isn't one running, the supervisor did its job of killing the process.

 

Let's look more closely at this:

 

netstat -tapn | grep 50020
tcp 1430 0 192.168.1.22:50020 192.168.1.31:7180 CLOSE_WAIT 14814/python2.6

 

WE no longer see any listening port, but we still see an open connection.  Here is what in confusing to me:  the "remote" host/port is typically Cloudera Manager's (192.168.1.31:7180) however, as you say, the PID of the process using the socket is the Agent's.  I'm trying to come up with a theory of how the connection could be from CM but now the agent owns the connection the 192.168.1.22:50020 socket.

 

I don't have an answer at this time; I'll need to think more about this.  Wild stuff

Highlighted
Explorer
Posts: 9
Registered: ‎10-18-2016

Re: Cloudera-scm-agent Using RegionServer Port

Thanks for confirming that I'm not nuts. :)
Announcements