Community Articles

mparisi · ‎12-04-2017

In this article, we’ll discuss tuning repository settings to achieve the highest possible throughput in MiNiFi C++. We’ll begin by benchmarking the SSD that we have. Based on reported numbers, the SSD in this 15” Macbook Pro is rated at 676 MB/s sequential write and 728 MB/s sequential read. Far from the highest, but our goal with this post is to achieve the highest possible throughput with the greatest guarantees. We’ll try different variations with and without volatile stores.

Our benchmark shows that we are achieving near optimal speeds across various runs.

We’ll create a MiNiFi C++ processor named Throughput Measure [1] that will be placed at the end of our graph to measure bytes output and measure the simple output from our flow. In our case, we’ll generate a flow file and simply measure the output. This won’t give us a completely accurate measure; however, it will be useful for tuning our setup.

The first thing we need to be cognizant of is that we need to have enough space in our connection queue from the GenerateFlowFile processor to the Throughput Measuring processor to avoid backpressure. Therefore, we will set the max queue size to 0 ( which makes it limitless) and the max queue data size to 1GB. You may play with this setting to give you a relative measurement. The higher you go the more likely you are to avoid starving the ThroughputMeasure Processor [1] [2]

Configuring MiNIFi C++

The next step is to begin changing max bytes and counts for all repositories. You may use the following values as I did:

nifi.volatile.repository.options.provenance.max.bytes=102439390

nifi.volatile.repository.options.flowfile.max.bytes=102439309

nifi.volatile.repository.options.content.max.bytes=1023999999

nifi.volatile.repository.options.content.max.count=500000

nifi.volatile.repository.options.content.minimal.locking=false

The reason we’re using this as the baseline is to increase the maximum bytes in our volatile repositories. In a subsequent post, I’ll discuss the process for arriving at these numbers. Lowering them will naturally adjust the maximum amount of data a volatile repository may consume, but will also dramatically lower the throughput. In this post we’re interested in achieving the highest possible throughput.The final option, above, "content.minimal.locking" forces the volatile content repository to disable a feature that is better suited for higher concurrent access. The feature titled "minimal locking" favors CAS operations over mutex based locking. Minimal locking will perform better with higher concurrent access. For the purposes of this test we know we won't have elevated concurrency, therefore disabling this feature will be to our benefit.

With repositories we have several options.

Repo Name	Repo type	Storage Type	Description
NoOpRepository	no-op	n/a	Enables the repository to be no-op, validating all puts and removals. Can be useful for testing
FlowFileRepository	flowfile	RocksDB (disk)	Stores FlowFile records in a RocksDB database, a WAL is used for improving reliability and performance.
ProvenanceRepository	provenance	RocksDB	Stores provenance events in a RocksDB database, a WAL is used for improving reliability and performance.
VolatileFlowFileRepository	flowfile	memory	Stores flow file records in memory, without persisting to disk.
VolatileProvenanceRepository	provenance	memory	Stores provenance events in memory, without persisting to disk.
VolatileContentRepository	content	memory	Stores content in memory, without persisting to disk.
DatabaseContentRepository	content	RocksDB	Stores content in a RocksDB database, reducing the number of inodes on disk.
FileSystemRepository	content	disk	Stores content on disk, in individual content files.

In measuring performance we can safely eliminate NoOpRepository as an option since this provides no functional benefit. You can use this if you wish to not maintain provenance; however, for the purposes of the rest of this article we'll be looking at maintaining our traditional flow of operations.

The test plan is to run a series of writes with a clean repository across each segment. We'll bundle flow file and volatile repositories together as a single option intersecting the results with the type of content repository. We'll run this for an hour reporting the median throughput.

Results

Repo Types	FlowFile Provenance	FlowFile Volatile Provenance	Volatile FlowFile Provenance	Volatile FlowFile Volatile Provenance
FileSystem	667 MB/s	650 MB/s	656 MB/s	628 MB/s
VolatileContent	524 MB/s	510 MB/s	1027 MB/s	1058 MB/s
DatabaseContent	118 MB/s	132 MB/s	150 MB/s	202 MB/s

The results above show that to the file system repository responds at roughly equivalent speeds. The VolatileContentRepository was slower when using a non volatile flow file repository. This is because the traditional flow file repository removes entries in batches. Within ten seconds, the VolatileContent repository would cause backpressure due to insufficient space, waiting no the Flow File Repository to clear flow files. In the case of the Volatile FlowFile repository we see higher throughput as would be expected with a volatile repository.

The DatabaseContent repo performed poorly across the board. This is likely because of the manner in which it flushes the write ahead log. Since we must guarantee the content we enforce flushing on ever write. Where the Database Content repository strives is on file systems with limited resources. In the case where we pit it against the FileSystem repository we see vastly different results on a RaspberryPi

Repo Types	FlowFile Provenance	FlowFile Volatile Provenance	Volatile FlowFile Provenance	Volatile FlowFile Volatile Provenance
FileSystem	Fail	Fail	32 MB/s	31 MB/s
DatabaseContent	33 MB/s	32 MB/s	33 MB/	32 MB/s

Analysis

The results, above, are quite intriguing. The throughput of our normal content repository was limited by the speed of our SSD. Configuring all repos to be volatile, while beneficial may not be the most desirable given the environmental conditions. In the example here we are quite limited by the amount of data that GenerateFlowFile produces. If we change our data source to be a trove of already formed data, we may see different results. In my next article I will discuss using this and employing more automated ways at controlling throughput.

[1] https://gist.github.com/phrocker/e7814920b5a724ace5016aa2ccd1ba7a

[2] https://github.com/phrocker/nifi-minifi-cpp/tree/ThroughputMeasure

Cloudera Community

Community Articles

Tuning MiNFI C++ repositores

Apache MiNiFi

Apache NiFi

Configuring MiNIFi C++

Results

Analysis

SQOOP Performance tuning

Demystify Apache Tez Memory Tuning - Step by Step

Tuning Hbase for optimized performance ( Part 1 )

Hive on Tez Performance Tuning - Determining Reduc...

Tuning Hbase for optimized performance ( Part 3 )

Tuning Hbase for optimized performance ( Part 2 )

Apache Storm Topology Tuning Approach

Ambari Server Performance Tuning & Troubleshooting...

Tuning Hbase for optimized performance ( Part 4 )

MiNiFi - C++ IoT Cat Sensor