Support Questions

drewski7 · ‎06-20-2023

I have been using multiple record oriented processors ConvertRecord, UpdateRecord, etc. in various parts of my flow. For example, my UpdateRecord processor takes about 16 seconds to read in a 30MB flowfile, add some fields to each record and convert that data to parquet.

I want to improve performance such that this takes a lot less time.

My infrastructure that I am working on currently is a 2 node e2-standard-4 cluster in GCP with a Centos7 operating system.

These two instances have 4 vCPUs and 16 GB RAM and for each repository (content, provenance, flowfile) I have separate SSD persistent drives.

A lot of the configs in NiFi are the defaults but what recommendations would anyone recommend either from an infrastructure or NiFi standpoint to improve performance on these processors.

steven-matison · ‎06-21-2023

@drewski7 Your actual bottleneck with the initial cluster size is the limitation of cores and ram. No matter what you do with concurrency, run schedule, the processor config, or other processors is limited by total cores and jvm on 2 machines. The total number of nodes itself is a limitation too. Ideally you want a master node at the top of flow, and its pushing down flowfiles to 2-3-4-5+ nodes to distribute the workload. That division of the workload is where nifi shines and you see massive throughput.

View solution in original post

steven-matison · ‎06-20-2023

@drewski7 This blog is a great place to start:

https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/

That said, some recommendations:

Recommend 3 nodes.
Use 32 or 64gb ram. Set min ram 16, max 32, let nifi/operating system leverage other 32gb.
Add more cores and tune Active Thread Count accordingly
Be careful which processors are Primary Only and which processors are not.
Do not over loadbalance queues, load balance at top of flow, let nifi distribute work load naturally after that.
Tune Processor Concurrency and Run Schedule. Be sure to understand how each work.
With a good setup tuned as above, have a plan to identify when time is appropriate to scale horizontally (add more nodes).

Here are some more docs that get specific into sizing:

https://docs.cloudera.com/cfm/2.1.1/nifi-sizing/topics/cfm-sizing-recommendations.html

https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.2/nifi-configuration-best-practices/content/conf...

drewski7 · ‎06-21-2023

@steven-matison - Thanks for response.

If we were to just scope it to looking at the UpdateRecord processor for example, are there any things from an infrastrucutre or configuration stand point you know of to make it more efficient assuming that I can't scale up or tune processor concurrency?

cotopaul · ‎06-21-2023

@drewski7,

UpdateRecord works as fast as you design it to work 🙂

For example, using UpdateRecord, I manage to generate 6 columns on a FlowFile, with more than 200k lines, in less than 7 seconds. For AVRO Files of 100MB, doing pretty much the same action will take around 15-20 seconds.

If you are using UpdateRecord to generate 100x columns and each of this columns is using a lookup to check something else or if it uses may functions on multiple columns, it is normal that it will take a long time to process. Besides that, if you are using the UpdateRecord on flowfiles with millions of rows, again, it will take longer to process.

So, in order to make your flow faster, you first need to identify where is the bottleneck. First things first, check the type of the file you are reading and the type of the file you are writing into. Each type has its pluses and minuses. Next, I suggest you to take a look on the number of rows in each flowfile --> processing 1M rows is slower than processing 500k rows. Afterwards, you should further check the functions you are applying in UpdateRecord and see if you can optimize them in any way.

drewski7 · ‎06-21-2023

@cotopaul - It's taking in JSON and writing to Parquet and only doing literal value replacements (ie. adding 5 fields to each record). 3 of those fields is just adding in attribute values and literal values to each record and the other two is doing minor date manipulation (ie converting dates to epoch).

cotopaul · ‎06-21-2023

@drewski7, in this case, have a look at @steven-matison 's answer because that is your solution to your problem.

steven-matison · ‎06-21-2023

@drewski7 Your actual bottleneck with the initial cluster size is the limitation of cores and ram. No matter what you do with concurrency, run schedule, the processor config, or other processors is limited by total cores and jvm on 2 machines. The total number of nodes itself is a limitation too. Ideally you want a master node at the top of flow, and its pushing down flowfiles to 2-3-4-5+ nodes to distribute the workload. That division of the workload is where nifi shines and you see massive throughput.

Cloudera Community

Support Questions

NiFi Optimization Record Oriented Processors