Created 06-20-2023 05:54 AM
I have been using multiple record oriented processors ConvertRecord, UpdateRecord, etc. in various parts of my flow. For example, my UpdateRecord processor takes about 16 seconds to read in a 30MB flowfile, add some fields to each record and convert that data to parquet.
I want to improve performance such that this takes a lot less time.
My infrastructure that I am working on currently is a 2 node e2-standard-4 cluster in GCP with a Centos7 operating system.
These two instances have 4 vCPUs and 16 GB RAM and for each repository (content, provenance, flowfile) I have separate SSD persistent drives.
A lot of the configs in NiFi are the defaults but what recommendations would anyone recommend either from an infrastructure or NiFi standpoint to improve performance on these processors.
Created 06-21-2023 06:21 AM
@drewski7 Your actual bottleneck with the initial cluster size is the limitation of cores and ram. No matter what you do with concurrency, run schedule, the processor config, or other processors is limited by total cores and jvm on 2 machines. The total number of nodes itself is a limitation too. Ideally you want a master node at the top of flow, and its pushing down flowfiles to 2-3-4-5+ nodes to distribute the workload. That division of the workload is where nifi shines and you see massive throughput.
Created 06-20-2023 06:42 AM
@drewski7 This blog is a great place to start:
https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/
That said, some recommendations:
Here are some more docs that get specific into sizing:
https://docs.cloudera.com/cfm/2.1.1/nifi-sizing/topics/cfm-sizing-recommendations.html
Created 06-21-2023 05:19 AM
@steven-matison - Thanks for response.
If we were to just scope it to looking at the UpdateRecord processor for example, are there any things from an infrastrucutre or configuration stand point you know of to make it more efficient assuming that I can't scale up or tune processor concurrency?
Created 06-21-2023 05:58 AM
@drewski7,
UpdateRecord works as fast as you design it to work 🙂
For example, using UpdateRecord, I manage to generate 6 columns on a FlowFile, with more than 200k lines, in less than 7 seconds. For AVRO Files of 100MB, doing pretty much the same action will take around 15-20 seconds.
If you are using UpdateRecord to generate 100x columns and each of this columns is using a lookup to check something else or if it uses may functions on multiple columns, it is normal that it will take a long time to process. Besides that, if you are using the UpdateRecord on flowfiles with millions of rows, again, it will take longer to process.
So, in order to make your flow faster, you first need to identify where is the bottleneck. First things first, check the type of the file you are reading and the type of the file you are writing into. Each type has its pluses and minuses. Next, I suggest you to take a look on the number of rows in each flowfile --> processing 1M rows is slower than processing 500k rows. Afterwards, you should further check the functions you are applying in UpdateRecord and see if you can optimize them in any way.
Created 06-21-2023 07:14 AM
@cotopaul - It's taking in JSON and writing to Parquet and only doing literal value replacements (ie. adding 5 fields to each record). 3 of those fields is just adding in attribute values and literal values to each record and the other two is doing minor date manipulation (ie converting dates to epoch).
Created 06-21-2023 07:46 AM
@drewski7, in this case, have a look at @steven-matison 's answer because that is your solution to your problem.
Created 06-21-2023 06:21 AM
@drewski7 Your actual bottleneck with the initial cluster size is the limitation of cores and ram. No matter what you do with concurrency, run schedule, the processor config, or other processors is limited by total cores and jvm on 2 machines. The total number of nodes itself is a limitation too. Ideally you want a master node at the top of flow, and its pushing down flowfiles to 2-3-4-5+ nodes to distribute the workload. That division of the workload is where nifi shines and you see massive throughput.