Support Questions

Find answers, ask questions, and share your expertise

Is anyone else having errors with Netty using Storm?

avatar
Contributor

I'm getting errors from my spout that look like:

2015-12-03 08:30:57.425 b.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-storm03/XXX.XXX.XXX.XXX:6700 is being closed

Is this an error with Netty or something to do with my stack? We're on HDP 2.3.2

Adam

1 ACCEPTED SOLUTION

avatar

Newer versions of Storm use Netty client library instead of an older mechanism (ZeroMQ) for establishing connections between Nimbus manager and Supervisor instances. These connections are critical especially if a Storm executor task dies, in order to re-launch a new task to take the place of the dead one. Storm tasks can die for several reasons: OutOfMemory errors, RuntimeExceptions, errors accessing storage, etc. Storm provides auto-recovery and fault tolerance through Nimbus by monitoring the running executors, and sending a heartbeat request to determine the state of the task (alive or dead). The Netty protocol is supposed to discard any unread messages following timeout, and that appears to be happening in this case. Normally, this Netty error is a symptom of some other root cause. Can you check the storm logs (under /var/log/storm) and look for a root cause, perhaps an OutOfMemory error in one of the topology-specific logs on port 6700 or 6701. It is normal for Nimbus to re-establish Netty connections in this scenario. Problems re-connecting can be due to using the wrong hostname or IP address (NIC) for Nimbus, which is provided by parameter "nimbus.host" or "nimbus.seed" (depending on your version of Storm). Make sure you configure Storm Nimbus and Supervisors to use the same hostname/server/IP address.

View solution in original post

11 REPLIES 11

avatar

Newer versions of Storm use Netty client library instead of an older mechanism (ZeroMQ) for establishing connections between Nimbus manager and Supervisor instances. These connections are critical especially if a Storm executor task dies, in order to re-launch a new task to take the place of the dead one. Storm tasks can die for several reasons: OutOfMemory errors, RuntimeExceptions, errors accessing storage, etc. Storm provides auto-recovery and fault tolerance through Nimbus by monitoring the running executors, and sending a heartbeat request to determine the state of the task (alive or dead). The Netty protocol is supposed to discard any unread messages following timeout, and that appears to be happening in this case. Normally, this Netty error is a symptom of some other root cause. Can you check the storm logs (under /var/log/storm) and look for a root cause, perhaps an OutOfMemory error in one of the topology-specific logs on port 6700 or 6701. It is normal for Nimbus to re-establish Netty connections in this scenario. Problems re-connecting can be due to using the wrong hostname or IP address (NIC) for Nimbus, which is provided by parameter "nimbus.host" or "nimbus.seed" (depending on your version of Storm). Make sure you configure Storm Nimbus and Supervisors to use the same hostname/server/IP address.

avatar
Explorer

Another thing to check is that the local hostname matches the name resolved by DNS. If for some reason they don't, and you can't modify DNS, you can override the local hostname for the storm node by setting `storm.local.hostname` in the YAML config for that node.

avatar
Contributor

We are pretty certain that it is not a DNS issue based on the other traffic occurring and that you can see the resolution in the error message: 2015-12-04 06:26:01.814 b.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-lnxvmhdpspXX.smrcy.com/XX.XX.XXX.XX:6703 is being closed

avatar
Contributor

This is turning out to be difficult to diagnose because the UI is not reporting out the error, and the topology is not reestablishing the links. The spout reports that it is emitting tuples, but the first bolt in the chain is not reporting that it is receiving them.

avatar

As Taylor said, make sure you have resolved DNS hostnames for all Storm supervisor nodes. Second, make sure you have wired your first Bolt to the Spout via some type of Grouping, such as FieldGrouping or ShuffleGrouping through the TopologyBuilder:

TopologyBuilder builder = new TopologyBuilder();        
builder.setSpout("words", new TestWordSpout(), 10);        
builder.setBolt("exclaim1", new ExclamationBolt(), 3)
        .shuffleGrouping("words");
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
        .shuffleGrouping("exclaim1");

avatar
Contributor

All of the spouts and bolts are properly connected to each other. During the course of normal operation, the topology passes messages along. However, after an error on one of the bolts, the messages leave one bolt to head to the next in the chain and fail to be received.

avatar

Please make sure you are running Storm Supervisors on each node where you want to run workers. Furthermore, make sure those Supervisors are up and running. You can monitor the health of supervisors through Ambari, then click on Services tab, and select Storm. If any supervisor is down, restart it. Also, based on your initial post, if you have just added new Storm nodes (supervisors), then make sure the required ports are open and available (6700, 6701, etc), so that Netty communications can access each node.

avatar
Explorer

We can also enable verbose GC logging in the worker child opts to look at the GC logs to understand if it is OutOfMemory:

-Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

avatar

@Adam Doyle Storm uses Netty library to communicate and pass tuples directly between Bolts and Spouts, rather than use a message queue. Therefore, when a Storm worker dies, the upstream bolts need to re-establish connections in order to resume the stream. Nimbus, which detects the lost worker, will restart a new instance of the crashed worker on the same or different Storm node. If the upstream bolt cannot establish a connection to the new instance of the downstream bolt, following timeout, it will no longer pass tuples downstream. This is the auto-recovery mechanism in Storm, and it depends on the reconnection logic, which depends on proper network infrastructure (DNS resolution, firewall rules, etc). Is there anything different about your new Storm (supervisor) nodes that would make them difficult to reach by Netty?