Created on 10-21-2016 09:11 PM - edited 08-17-2019 08:42 AM
How many times we've heard that a problem can't be solved if it isn't completely understood? However, all of us, even when having a lot of experience and technical skill, we still tend to jump to the solution workspace by-passing key steps in the scientific approach. We just don’t have enough patience and, quite often, the business is on our back to deliver a quick fix. We just want to get it done! We found ourselves like that squirrel in the Ice Age chasing for the acorn just using our experience and minimal experiment, dreaming that the acorn will fall on our lap. This article is meant to re-emphasize a repeatable approach that can be applied to tune Storm topologies starting from understanding the problem, forming a hypothesis, testing that hypothesis, collecting the data, analyzing the data, drawing conclusions based on facts and less on empirical. Hopefully it helps to catch more acorns!
Understand the functional problem resolved by the topology; quite often the best tuning is executing a step differently in the workflow for the same end-result, maybe apply a filter first and then execute the step, rather executing the step forever and paying the penalty every time. You can also prioritize success or learn to lose in order to win. This is the time also to get some sense of business SLA and forecasted growth. Document the SLA by topology. Try to understand the reasons behind the SLA and identify opportunities for trade-offs.
This is the time also to get some sense of business SLA and forecasted growth. Document the SLA by topology. Try to understand the reasons behind the SLA and identify opportunities for trade-offs.
Gather data using Storm UI and other tools specific to the environment. Storm UI is your first to go-to tool for tuning, most of the time.
Another tool to use is Storm’s builtin Metrics-Collecting API. This is available with 0.9.x. This helps to built-in metrics in your topology. You may have tuned it, deployed the production, everybody is happy and two days later the issue comes back. Wouldn’t be nice to have some metrics embedded on your code, already focused on the usual suspects, e.g. external calls to a web service or a SQL to your lookup data store, specific customers or some other strong business entity that due to data and the workflow can become the bottleneck?
In a storm topology, you have spouts, bolts, workers that can play a role in your topology challenges.
Learn about topologies status, uptime, number of workers, executors, tasks, high-level stats across four time windows, spout statistics, bolt statistics, visualization of the spouts and the bolts and how they are connected, the flow of tuples between all of the streams, all configurations for the topology.
These methods can be applied over and over until meet the SLAs. This effort is high and automating steps is desirable.
It’s easy when interacting with external services (such as a SOA service, database, or filesystem) to ratchet up the parallelism to a high enough level in a topology that limits in that external service keep your capacity from going higher. Before you start tuning parallelism in a topology that interacts with the outside world, be positive you have good metrics on that service. You could keep turning up the parallelism on bolt to the point that it brings the external service to its knees, crippling it under a mass of traffic that it’s not designed to handle.
Let’s talk about one of the greatest enemies of fast code: latency. There’s latency accessing local memory and hard drive, and accessing another system over the network. Different interactions have different levels of latency, and understanding the latency in your system is one of the keys to tuning your topology. You’ll usually get fairly consistent response times and all of the sudden those response times will vary wildly because of any number of factors:
As the topology interacts with external services, let’s be smarter about our latency. We don't always need to increase parallelism and allocate more resources. We may discover after investigation that your variance is based on customer or induced by an exceptional event, e.g. elections, Olympics, etc. Certain records are going to be slower to look up.
Sometimes one of those “fast” lookups might end up taking longer. One strategy that has worked well with services that exhibit this problem is to perform initial lookup attempts with a hard ceiling on timeouts, try various ceilings, if that fails, send it to a less parallelized instance of the same bolt that will take longer with its timeout. The end result is that the time to process a large number of messages goes down. Your mileage will vary with this strategy, so test before you use it.
As part of your tuning strategy, you could end-up to break a bolt into two: one for fast lookups and the other for slow lookups. At least, what goes fast let it go fast, don’t back it up with a slow one. Treat the slow ones differently and make sure they are the exception. You may even decide to address those differently and still get some value of that data. It is up-to you to determine what is more important: having an overall bottleneck or mark for later processing or disregard those producing a bottleneck. Maybe a batch approach for those is more efficient. This is a design consideration and a reasonable trade-off most of the time. Are you willing to accept the occasional loss of fidelity but still hit the SLA because that is more important for your business? Is Perfection helping or killing your business?
Storm Applied: Strategies for real-time event processing by Sean T. Allen, Matthew Jankowski, and Peter Pathirana