Created on 12-04-2017 03:10 PM - edited 08-17-2019 09:59 AM
In this article, we’ll discuss tuning repository settings to achieve the highest possible throughput in MiNiFi C++. We’ll begin by benchmarking the SSD that we have. Based on reported numbers, the SSD in this 15” Macbook Pro is rated at 676 MB/s sequential write and 728 MB/s sequential read. Far from the highest, but our goal with this post is to achieve the highest possible throughput with the greatest guarantees. We’ll try different variations with and without volatile stores.
Our benchmark shows that we are achieving near optimal speeds across various
runs.
We’ll create a MiNiFi C++ processor named Throughput Measure [1] that will be placed at the end of our graph to measure bytes output and measure the simple output from our flow. In our case, we’ll generate a flow file and simply measure the output. This won’t give us a completely accurate measure; however, it will be useful for tuning our setup.
The first thing we need to be cognizant of is that we need to have enough space in our connection queue from the GenerateFlowFile processor to the Throughput Measuring processor to avoid backpressure. Therefore, we will set the max queue size to 0 ( which makes it limitless) and the max queue data size to 1GB. You may play with this setting to give you a relative measurement. The higher you go the more likely you are to avoid starving the ThroughputMeasure Processor [1] [2]
The next step is to begin changing max bytes and counts for all repositories. You may use the following values as I did:
nifi.volatile.repository.options.provenance.max.bytes=102439390
nifi.volatile.repository.options.flowfile.max.bytes=102439309
nifi.volatile.repository.options.content.max.bytes=1023999999
nifi.volatile.repository.options.content.max.count=500000
nifi.volatile.repository.options.content.minimal.locking=false
The reason we’re using this as the baseline is to increase the maximum bytes in our volatile repositories. In a subsequent post, I’ll discuss the process for arriving at these numbers. Lowering them will naturally adjust the maximum amount of data a volatile repository may consume, but will also dramatically lower the throughput. In this post we’re interested in achieving the highest possible throughput.The final option, above, "content.minimal.locking" forces the volatile content repository to disable a feature that is better suited for higher concurrent access. The feature titled "minimal locking" favors CAS operations over mutex based locking. Minimal locking will perform better with higher concurrent access. For the purposes of this test we know we won't have elevated concurrency, therefore disabling this feature will be to our benefit.
With repositories we have several options.
Repo Name | Repo type | Storage Type | Description |
NoOpRepository | no-op | n/a | Enables the repository to be no-op, validating all puts and removals. Can be useful for testing |
FlowFileRepository | flowfile | RocksDB (disk) | Stores FlowFile records in a RocksDB database, a WAL is used for improving reliability and performance. |
ProvenanceRepository | provenance | RocksDB | Stores provenance events in a RocksDB database, a WAL is used for improving reliability and performance. |
VolatileFlowFileRepository | flowfile | memory | Stores flow file records in memory, without persisting to disk. |
VolatileProvenanceRepository | provenance | memory | Stores provenance events in memory, without persisting to disk. |
VolatileContentRepository | content | memory | Stores content in memory, without persisting to disk. |
DatabaseContentRepository | content | RocksDB | Stores content in a RocksDB database, reducing the number of inodes on disk. |
FileSystemRepository | content | disk | Stores content on disk, in individual content files. |
In measuring performance we can safely eliminate NoOpRepository as an option since this provides no functional benefit. You can use this if you wish to not maintain provenance; however, for the purposes of the rest of this article we'll be looking at maintaining our traditional flow of operations.
The test plan is to run a series of writes with a clean repository across each segment. We'll bundle flow file and volatile repositories together as a single option intersecting the results with the type of content repository. We'll run this for an hour reporting the median throughput.
Repo Types | FlowFile Provenance | FlowFile Volatile Provenance | Volatile FlowFile Provenance | Volatile FlowFile Volatile Provenance |
FileSystem | 667 MB/s | 650 MB/s | 656 MB/s | 628 MB/s |
VolatileContent | 524 MB/s | 510 MB/s | 1027 MB/s | 1058 MB/s |
DatabaseContent | 118 MB/s | 132 MB/s | 150 MB/s | 202 MB/s |
The results above show that to the file system repository responds at roughly equivalent speeds. The VolatileContentRepository was slower when using a non volatile flow file repository. This is because the traditional flow file repository removes entries in batches. Within ten seconds, the VolatileContent repository would cause backpressure due to insufficient space, waiting no the Flow File Repository to clear flow files. In the case of the Volatile FlowFile repository we see higher throughput as would be expected with a volatile repository.
The DatabaseContent repo performed poorly across the board. This is likely because of the manner in which it flushes the write ahead log. Since we must guarantee the content we enforce flushing on ever write. Where the Database Content repository strives is on file systems with limited resources. In the case where we pit it against the FileSystem repository we see vastly different results on a RaspberryPi
Repo Types | FlowFile Provenance | FlowFile Volatile Provenance | Volatile FlowFile Provenance | Volatile FlowFile Volatile Provenance |
FileSystem | Fail | Fail | 32 MB/s | 31 MB/s |
DatabaseContent | 33 MB/s | 32 MB/s | 33 MB/ | 32 MB/s |
The results, above, are quite intriguing. The throughput of our normal content repository was limited by the speed of our SSD. Configuring all repos to be volatile, while beneficial may not be the most desirable given the environmental conditions. In the example here we are quite limited by the amount of data that GenerateFlowFile produces. If we change our data source to be a trove of already formed data, we may see different results. In my next article I will discuss using this and employing more automated ways at controlling throughput.
[1] https://gist.github.com/phrocker/e7814920b5a724ace5016aa2ccd1ba7a
[2] https://github.com/phrocker/nifi-minifi-cpp/tree/ThroughputMeasure