Created on 07-07-2016 11:25 PM
The release of version 1.0 marks another major milestone for Storm. Since becoming an Apache project in Sept 2013, much work has gone into maturing the feature set and also improving performance by reworking or tweaking various components. Some of the notable changes that contribute to improved performance are:
In this blog we shall take a look at performance improvements in Storm since its incubation into Apache. To quantify this, we shall compare the performance numbers of Storm v.0.9.0.1, which was the last pre-apache release, with the most recent Storm v1.0.1. Storm v.0.9.0.1 has also been used as a reference point for performance comparisons against Heron.
Given the existence of recent efforts to benchmark Storm “at scale”, here we shall examine performance from a different angle. We narrow the focus to some specific core areas of Storm using a collection of simple topologies. To contain the scope, we have limited the scope to Storm core (i.e. no Trident).
Each topology was given at least 4 mins of “warm up” execution time before taking measurements. Subsequently, after a minimum of 10 minutes, metrics were captured from the Web UI for the last 10-minute window. The captured numbers have been rounded off for readability. In all cases ACK-ing was enabled with 1 ACKer bolt executor. Throughput (i.e tuples/sec) was calculated by dividing the total ACKs for the 10 min window by 600.
Due to some backward incompatibilities (mostly namespace changes) in Storm, two versions of the topologies were written, one for each Storm version. As a general principle we have avoided configuration tweaks to tune performance and stayed with default values. The only config setting we applied was to set the max heap size of the worker to 8GB to ensure memory.
Hardware:
Hardware:
All nodes had the following configuration:
Here we measure how fast a single spout can emit tuples.
Topology: This is the simplest topology. It consists of a ConstSpout that repeatedly emits the string “some data” and no bolts. Spout parallelism is set to 1, so there is only one instance of the spout executing. Here we measure the number of emits per second. Latency is not relevant as there are no bolts.
Topology Code:
Measurements:
v0.9.0.1:
Emit rate: 108 k tuples/sec
v1.0.1:
Emit rate: 3.2 million tuples/sec
The goal is to measure the speed at which tuples can be transferred between a spout and a bolt running within the same worker process.
Topology: Consists of a ConstSpout that repeatedly emits the string “some data” and a DevNull bolt which ACKs every incoming tuple and discards them. The spout, bolt and acker were given 1 executor each. The spout and bolt were both run within the same worker.
Topology Code:
Measurements:
v0.9.0.1:
Throughput: 87k/sec Latency: 16ms
v1.0.1:
Throughput: 233k/sec Latency: 3.4ms
The goal is to measure the speed at which tuples are transferred between a spout and a bolt, when both are running on two separate worker processes on the same machine.
Topology: Same topology as the one used for Intra-worker messaging speed. The spout and bolt were however run on two separate workers on the same host. The bolt and the acker were observed to be running on the same worker.
Topology Code:
Measurements:
v0.9.0.1:
Throughput: 48 k/sec Latency: 170 ms
v1.0.1:
Throughput: 287 k/sec Latency: 8 ms
The goal is to measure the speed at which tuples are transferred when the spout, bolt and acker are all running on separate worker processes on the same machine.
Topology: Same topology as the one used for Intra-worker messaging speed. The spout, bolt and acker were however run on three separate workers on the same host.
Topology Code:
Measurements:
v0.9.0.1:
Throughput: 43 k/sec Latency: 116 ms
v1.0.1:
Throughput: 292 k/sec Latency: 8.6 ms
The goal is to measure the speed at which tuples are transferred between a spout and a bolt, when both are running on two separate worker processes running on (2) different machines.
Topology: Same topology as the one used for Intra-worker messaging speed but the spout and bolt were run on two separate workers on two different hosts. The bolt and the acker were observed to be running on the same worker.
Topology Code:
Measurements:
v0.9.0.1:
Throughput: 48 k/sec Latency: 845 ms
v1.0.1:
Throughput: 316 k/sec Latency: 13.3 ms
Here we measure the speed at which tuples are transferred when the spout, bolt and acker are all running on separate worker processes on 3 different machines.
Topology: Again same topology as inter-host 1, but this time the acker run on a separate host.
Topology Code:
Measurements:
v0.9.0.1:
Throughput: 50 k/sec Latency: 1700 ms
v1.0.1:
Throughput: 303 k/sec Latency: 7.4 ms
Storm version | Spout Emit | MSGing IntraWorker | MSGing InterWorker 1 | MSGing InterWorker 2 | MSGing InterHost 1 | MSGing InterHost 2 |
v0.9.0.1 | 108,000 | 87,000 | 48,000 | 43,000 | 48,000 | 50,000 |
v1.0.1 | 3,200,000 | 233,000 | 287,000 | 292,000 | 316,000 | 303,000 |
Storm ver | MSGing IntraWorker | MSGing InterWorker 1 | MSGing InterWorker 2 | MSGing InterHost 1 | MSGing InterHost 2 |
v1.0.1 | 3 | 8 | 9 | 13 | 7 |
v0.9.0.1 | 16 | 170 | 116 | 845 | 1,700 |