Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
New Contributor

In this article, we’ll discuss tuning repository settings to achieve the highest possible throughput in MiNiFi C++. We’ll begin by benchmarking the SSD that we have. Based on reported numbers, the SSD in this 15” Macbook Pro is rated at 676 MB/s sequential write and 728 MB/s sequential read. Far from the highest, but our goal with this post is to achieve the highest possible throughput with the greatest guarantees. We’ll try different variations with and without volatile stores.


Our benchmark shows that we are achieving near optimal speeds across various runs.

43811-screen-shot-2017-12-04-at-95521-am.png

We’ll create a MiNiFi C++ processor named Throughput Measure [1] that will be placed at the end of our graph to measure bytes output and measure the simple output from our flow. In our case, we’ll generate a flow file and simply measure the output. This won’t give us a completely accurate measure; however, it will be useful for tuning our setup.

The first thing we need to be cognizant of is that we need to have enough space in our connection queue from the GenerateFlowFile processor to the Throughput Measuring processor to avoid backpressure. Therefore, we will set the max queue size to 0 ( which makes it limitless) and the max queue data size to 1GB. You may play with this setting to give you a relative measurement. The higher you go the more likely you are to avoid starving the ThroughputMeasure Processor [1] [2]

Configuring MiNIFi C++

The next step is to begin changing max bytes and counts for all repositories. You may use the following values as I did:

nifi.volatile.repository.options.provenance.max.bytes=102439390

nifi.volatile.repository.options.flowfile.max.bytes=102439309

nifi.volatile.repository.options.content.max.bytes=1023999999

nifi.volatile.repository.options.content.max.count=500000

nifi.volatile.repository.options.content.minimal.locking=false

The reason we’re using this as the baseline is to increase the maximum bytes in our volatile repositories. In a subsequent post, I’ll discuss the process for arriving at these numbers. Lowering them will naturally adjust the maximum amount of data a volatile repository may consume, but will also dramatically lower the throughput. In this post we’re interested in achieving the highest possible throughput.The final option, above, "content.minimal.locking" forces the volatile content repository to disable a feature that is better suited for higher concurrent access. The feature titled "minimal locking" favors CAS operations over mutex based locking. Minimal locking will perform better with higher concurrent access. For the purposes of this test we know we won't have elevated concurrency, therefore disabling this feature will be to our benefit.

With repositories we have several options.

Repo NameRepo type Storage Type Description
NoOpRepositoryno-opn/aEnables the repository to be no-op, validating all puts and removals. Can be useful for testing
FlowFileRepository flowfileRocksDB (disk)Stores FlowFile records in a RocksDB database, a WAL is used for improving reliability and performance.
ProvenanceRepository provenanceRocksDBStores provenance events in a RocksDB database, a WAL is used for improving reliability and performance.
VolatileFlowFileRepositoryflowfilememoryStores flow file records in memory, without persisting to disk.
VolatileProvenanceRepositoryprovenancememoryStores provenance events in memory, without persisting to disk.
VolatileContentRepositorycontentmemoryStores content in memory, without persisting to disk.
DatabaseContentRepositorycontentRocksDBStores content in a RocksDB database, reducing the number of inodes on disk.
FileSystemRepositorycontentdiskStores content on disk, in individual content files.

In measuring performance we can safely eliminate NoOpRepository as an option since this provides no functional benefit. You can use this if you wish to not maintain provenance; however, for the purposes of the rest of this article we'll be looking at maintaining our traditional flow of operations.

The test plan is to run a series of writes with a clean repository across each segment. We'll bundle flow file and volatile repositories together as a single option intersecting the results with the type of content repository. We'll run this for an hour reporting the median throughput.

Results

Repo TypesFlowFile
Provenance
FlowFile

Volatile Provenance

Volatile FlowFile

Provenance

Volatile FlowFile

Volatile Provenance

FileSystem667 MB/s650 MB/s656 MB/s628 MB/s
VolatileContent524 MB/s510 MB/s1027 MB/s1058 MB/s
DatabaseContent118 MB/s132 MB/s150 MB/s202 MB/s

The results above show that to the file system repository responds at roughly equivalent speeds. The VolatileContentRepository was slower when using a non volatile flow file repository. This is because the traditional flow file repository removes entries in batches. Within ten seconds, the VolatileContent repository would cause backpressure due to insufficient space, waiting no the Flow File Repository to clear flow files. In the case of the Volatile FlowFile repository we see higher throughput as would be expected with a volatile repository.

The DatabaseContent repo performed poorly across the board. This is likely because of the manner in which it flushes the write ahead log. Since we must guarantee the content we enforce flushing on ever write. Where the Database Content repository strives is on file systems with limited resources. In the case where we pit it against the FileSystem repository we see vastly different results on a RaspberryPi

Repo TypesFlowFile
Provenance
FlowFile
Volatile Provenance
Volatile FlowFile
Provenance
Volatile FlowFile
Volatile Provenance
FileSystemFailFail 32 MB/s 31 MB/s
DatabaseContent33 MB/s32 MB/s33 MB/32 MB/s

Analysis

The results, above, are quite intriguing. The throughput of our normal content repository was limited by the speed of our SSD. Configuring all repos to be volatile, while beneficial may not be the most desirable given the environmental conditions. In the example here we are quite limited by the amount of data that GenerateFlowFile produces. If we change our data source to be a trove of already formed data, we may see different results. In my next article I will discuss using this and employing more automated ways at controlling throughput.

[1] https://gist.github.com/phrocker/e7814920b5a724ace5016aa2ccd1ba7a

[2] https://github.com/phrocker/nifi-minifi-cpp/tree/ThroughputMeasure

181 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 09:59 AM
Updated by:
 
Contributors
Top Kudoed Authors