Support Questions

Hrishi · ‎07-24-2018

Hello,

We are seeing the below error for some job failures,

+++++++

INFO - java.net.BindException: Problem binding to [hostname/IP:0] java.net.BindException: Cannot assign requested address;

+++++++

As per apache wiki,

++++++++++

If the port is "0", then the OS is looking for any free port -so the port-in-use and port-below-1024 problems are highly unlikely to be the cause of the problem. Hostname confusion and network setup are the likely causes.

++++++++++

workflow job scheduler hostname is mentioned in the error above and this happens during an HDFS command execution step. Any idea why it is happening?

Hrishi · ‎07-24-2018

INFO - 2018-07-24 05:16:24,511 INFO [main] retry.RetryInvocationHandler (RetryInvocationHandler.java:invoke(148)) - Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over namenode_host/IP after 1 fail over attempts. Trying to fail over immediately.
INFO - java.net.BindException: Problem binding to [scheduler_hostname/IP:0] java.net.BindException: Cannot assign requested address; For more details see: http://wiki.apache.org/hadoop/BindException
INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
INFO - at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
INFO - at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
INFO - at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:720)
INFO - at org.apache.hadoop.ipc.Client.call(Client.java:1476)
INFO - at org.apache.hadoop.ipc.Client.call(Client.java:1409)
INFO - at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
INFO - at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
INFO - at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO - at java.lang.reflect.Method.invoke(Method.java:606)
INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
INFO - at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
INFO - at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
INFO - at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
INFO - at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
INFO - at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
INFO - at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
INFO - at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1417)
INFO - Caused by: java.net.BindException: Cannot assign requested address
INFO - at sun.nio.ch.Net.connect0(Native Method)
INFO - at sun.nio.ch.Net.connect(Net.java:465)
INFO - at sun.nio.ch.Net.connect(Net.java:457)
INFO - at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
INFO - at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
INFO - at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
INFO - at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
INFO - at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:615)
INFO - at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:714)
INFO - at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
INFO - at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
INFO - at org.apache.hadoop.ipc.Client.call(Client.java:1448)
INFO - ... 17 more

Harsh J · ‎07-29-2018

Your OS seems to be running out of free port numbers in the ephemeral range. Typically on Linux this is in range 32k to 64k, which is quite a lot of ports.

A common reason is abuse of software clients (due to excessive connections being created without use of shared connection pools, or a leak of connections due to non-closure in the code), or lower level problems with the socket closure (such as the FIN stage of TCP not being correctly processed, causing the OS to hold the port open for an extended period of time waiting for the final close to complete).

Are you perhaps executing a lot of concurrent programs on your cluster, or use a multi-threaded app that builds a new network client (for HDFS, etc.) under each thread?

When you experience this, you could run an lsof check on the host of the failing task to find which PID(s) are occupying most of the network client ephemeral ports and if there is a pattern to their destination(s). This can help figure out where the problem specifically lies, and what category (in the above) it may belong to.