Created 12-06-2022 02:29 AM
Hi All,
I am running 4 node Nifi cluster with each node having 8 core and 32 GB or RAM. it was running fine until I introduced QueyRecord processor to process huge metric data (14 GB per min). the CPU now consistently remaining above 85% and memory above 90% (buffering too much). Can someone let me know how to diagnose the issue and find the root cause.
Thanks in advance
Created 12-06-2022 12:48 PM
@Onkar_Gagre
1. What is the CPU and Memory usage of your NiFi instances when the QueryRecord processor is stopped?
2. How is your QueryRecord processor configured to include scheduling and concurrent task configurations?
What other processors were introduced as part of this new dataflow?
3. What does disk I/O look like while this processor is running?
NiFi documentation does not mention any CPU or Memory specific resource considerations when using this processor.
Thanks,
Matt
Created 12-07-2022 03:16 AM
1. What is the CPU and Memory usage of your NiFi instances when the QueryRecord processor is stopped?
This is almost drops down to 20% straight from 90%. Even buffered RAM also somewhat released
2. How is your QueryRecord processor configured to include scheduling and concurrent task configurations?
We have consumerKafkaRecord processor feeding data into QueryRecord. the data is huge with smaller chunk of messages hence using demarcation I am batching multiple records into single flowfile and forwarding it to QueryRecord then based on select statement these likely grouped record routed further to Kafkaproducer processor.
ConsumeKafka running with total 32 (8 thread per node) threads whereas QueryRecord running with 160(40 thread per node) threads to keep up with the incoming data. both having run duration set to 0(always running) and run schedule is also set to 0.
What other processors were introduced as part of this new dataflow?
consumerKafkaRecord,PublishkafkaRecord
3. What does disk I/O look like while this processor is running?
Disk I/O given below for each individual node.
Node 1
Linux 3.10.0-1160.6.1.el7.x86_64 () 07/12/22 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
27.61 0.00 4.18 0.51 0.00 67.69
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 20.11 45.92 86.58 1942.35 27986.13 451.74 0.08 0.62 1.92 1.00 0.2012 2.67
sda 0.00 0.31 9.64 11.27 430.75 136.20 54.22 0.01 0.26 0.90 7.92 0.2176 0.46
sdb 0.00 0.00 0.03 0.00 0.66 0.00 51.43 0.00 0.40 0.40 0.00 0.2447 0.00
dm-0 0.00 0.00 4.55 0.14 189.37 1.18 81.32 0.00 1.01 0.86 5.78 0.3576 0.17
dm-1 0.00 0.00 0.02 0.00 0.58 0.00 51.80 0.00 0.41 0.41 0.00 0.2282 0.00
dm-2 0.00 0.00 45.92 106.70 1942.27 27986.13 392.19 0.02 0.14 0.02 0.19 0.1751 2.67
dm-3 0.00 0.00 0.01 0.00 0.19 0.00 42.84 0.00 0.58 0.49 1.75 0.3694 0.00
dm-4 0.00 0.00 0.04 0.01 6.58 0.69 257.40 0.00 7.83 1.71 29.56 0.5251 0.00
dm-5 0.00 0.00 4.12 10.31 200.81 126.45 45.37 0.09 6.34 0.95 8.49 0.1599 0.23
dm-6 0.00 0.00 0.88 0.66 31.04 4.32 45.87 0.00 1.18 0.81 1.67 0.3023 0.05
dm-7 0.00 0.00 0.01 0.00 0.15 0.00 54.79 0.00 0.45 0.44 1.28 0.2212 0.00
dm-8 0.00 0.00 0.03 0.46 1.00 3.56 18.70 0.00 1.51 0.79 1.56 0.3988 0.02
sdd 0.00 0.00 1.80 0.04 82.56 0.89 91.01 0.00 1.03 1.02 1.67 0.0830 0.02
dm-9 0.00 0.00 1.79 0.04 82.51 0.89 90.96 0.00 1.03 1.02 1.65 0.0832 0.02
Node 2
Linux 3.10.0-1160.6.1.el7.x86_64 () 07/12/22 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
10.02 0.00 2.96 0.45 0.00 86.56
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.29 0.71 7.09 27.72 88.67 29.85 0.06 7.18 0.89 7.81 0.2198 0.17
sdc 0.00 0.82 22.34 28.81 831.68 12659.33 527.55 0.00 0.02 1.21 2.53 0.9838 5.03
sdb 0.03 0.06 0.03 0.03 0.26 0.37 18.50 0.00 1.39 0.40 2.41 1.0156 0.01
dm-0 0.00 0.00 0.39 0.13 16.01 1.10 66.36 0.00 2.00 0.82 5.58 0.4171 0.02
dm-1 0.00 0.00 0.06 0.09 0.24 0.37 8.04 0.00 3.17 0.47 4.92 0.4512 0.01
dm-2 0.00 0.00 22.33 29.62 831.66 12659.33 519.31 0.03 0.58 1.21 0.10 0.9686 5.03
dm-3 0.00 0.00 0.00 0.00 0.01 0.00 4.10 0.00 0.39 0.24 1.46 0.3310 0.00
dm-4 0.00 0.00 0.01 0.01 2.01 0.73 245.81 0.00 15.63 1.69 31.50 0.5168 0.00
dm-5 0.00 0.00 0.20 6.17 6.61 79.26 26.97 0.05 8.47 1.13 8.71 0.1882 0.12
dm-6 0.00 0.00 0.09 0.63 2.49 4.07 18.37 0.00 1.52 0.66 1.64 0.2323 0.02
dm-7 0.00 0.00 0.00 0.00 0.01 0.00 3.40 0.00 0.24 0.23 1.25 0.2337 0.00
dm-8 0.00 0.00 0.01 0.45 0.27 3.50 16.53 0.00 1.24 0.62 1.24 0.3516 0.02
sdd 0.00 0.00 0.00 0.03 0.21 0.70 47.43 0.00 1.39 0.91 1.46 0.5485 0.00
dm-9 0.00 0.00 0.00 0.04 0.19 0.70 45.75 0.00 1.44 1.38 1.44 0.5514 0.00
Node 3
Linux 3.10.0-1160.6.1.el7.x86_64 () 07/12/22 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.23 0.00 6.84 0.13 0.00 19.81
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 81.50 39.85 312.01 3142.78 100006.63 586.31 0.13 0.36 2.56 0.08 0.2139 7.52
sda 0.01 0.37 25.76 7.96 1161.80 106.70 75.25 0.04 1.12 1.08 1.27 0.3697 1.25
sdd 0.00 0.02 2.67 0.29 104.08 6.11 74.36 0.00 1.12 1.14 0.95 0.3594 0.11
sdb 2.61 2.90 1.76 1.12 17.94 16.09 23.66 0.01 2.29 1.54 3.46 1.1372 0.33
dm-0 0.00 0.00 13.96 0.21 609.03 1.80 86.21 0.01 0.99 0.99 1.16 0.4736 0.67
dm-1 0.00 0.00 4.36 4.02 17.88 16.09 8.11 0.03 3.04 1.47 4.74 0.3903 0.33
dm-2 0.00 0.00 2.66 0.31 104.02 6.11 74.22 0.00 1.13 1.15 0.95 0.3598 0.11
dm-3 0.00 0.00 0.03 0.00 0.56 0.00 38.32 0.00 0.80 0.80 0.84 0.7141 0.00
dm-4 0.00 0.00 0.07 0.01 9.15 0.37 229.15 0.00 3.96 4.13 2.73 1.7046 0.01
dm-5 0.00 0.00 9.15 6.92 459.28 96.14 69.14 0.02 1.25 1.19 1.32 0.2486 0.40
dm-6 0.00 0.00 2.45 0.70 77.04 4.76 51.90 0.00 1.04 1.05 1.04 0.4814 0.15
dm-7 0.00 0.00 0.02 0.00 0.54 0.00 43.71 0.00 0.29 0.29 0.74 0.2028 0.00
dm-8 0.00 0.00 0.13 0.48 3.51 3.64 23.29 0.00 0.97 1.21 0.91 0.4330 0.03
dm-9 0.00 0.00 39.84 393.51 3142.72 100006.63 476.06 0.26 0.61 2.58 0.41 0.1758 7.62
Node 4
Linux 3.10.0-1160.6.1.el7.x86_64 () 07/12/22 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
20.01 0.00 3.07 0.28 0.00 76.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 12.92 48.20 62.41 1739.08 20099.77 394.88 0.07 0.66 1.06 0.35 0.3109 3.44
sdb 0.00 0.00 0.02 0.00 0.67 0.00 54.01 0.00 0.30 0.30 0.00 0.2148 0.00
sda 0.00 0.31 5.91 6.95 320.94 87.50 63.51 0.02 1.57 1.88 1.31 0.3440 0.44
dm-0 0.00 0.00 3.80 0.12 199.60 1.08 102.36 0.01 1.64 1.64 1.57 0.5978 0.23
dm-1 0.00 0.00 0.02 0.00 0.59 0.00 51.80 0.00 0.28 0.28 0.00 0.1954 0.00
dm-2 0.00 0.00 48.20 75.33 1739.00 20099.77 353.57 0.05 0.32 1.38 0.63 0.2787 3.44
dm-3 0.00 0.00 0.00 0.00 0.07 0.00 34.24 0.00 0.95 0.95 0.95 0.8181 0.00
dm-4 0.00 0.00 0.03 0.01 4.49 0.60 239.72 0.00 4.34 4.26 4.60 1.1573 0.00
dm-5 0.00 0.00 1.38 6.04 79.01 78.31 42.40 0.01 1.58 2.54 1.36 0.2008 0.15
dm-6 0.00 0.00 0.69 0.66 35.55 4.15 58.90 0.00 1.39 1.80 0.95 0.4093 0.06
dm-7 0.00 0.00 0.00 0.00 0.06 0.00 50.53 0.00 0.36 0.35 1.05 0.2573 0.00
dm-8 0.00 0.00 0.02 0.43 0.49 3.36 17.44 0.00 0.95 2.32 0.90 0.4003 0.02
sdd 0.00 0.00 1.78 0.03 125.98 0.68 140.36 0.00 2.60 2.61 1.99 0.2107 0.04
dm-9 0.00 0.00 1.78 0.03 125.95 0.68 140.00 0.00 2.60 2.61 1.98 0.2099 0.04
Created 12-09-2022 01:28 PM
Let's take a look at concurrent task here....
You have a an 8 core machine.
You have a ConsumeKafka configured with 8 concurrent tasks and 4 nodes. I hope this means your Kafka topic has 32 partitions because that processor creates a consumer group with the 8 consumers from each node as part of that consumer group. Kafka will only assign one consumer from a consumer group to 1 partition. So having more consumer then partitions gains you nothin, but can cause performance issues caused by rebalance.
Then you have a QueryRecord with 40 Concurrent tasks per node. Each allocated thread across your entire Dataflow needs time on the CPU. So just between these two processor alone, you are scheduling up to 48 concurrent threads that must be handled by only 8 cores. Based on your description of data volume, it sounds like a lot of CPU wait when enable this processor as each thread is only get a fraction of time on the CPU and thus taking long to complete its task. Sounds like you need more Cores to handle your dataflow and not necessarily an issue specific to the use of the QueryRecord processor.
While you maybe scheduling concurrent tasks too high for your system on the QueryRecord processor, The scheduled thread come from the Max Timer Driven Thread pool set in yoru NiFi. The default is 10 and I assume you increased this higher to accommodate the concurrent tasks you have been assigning to your individual processors. The general starting recommendation for the Max Timer Driven Thread pool setting is 2 to 4 Times the number of cores on your node. So with an 8 core machine that recommendation would be 16 - 32. The decision/ability to set that even higher is all about your dataflow behavior along with your data volumes. It requires you to monitor cpu usage ad adjust the pool size in small increments. Once CPU is maxed there is nothing much we can do with create more CPU.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt
Created 12-11-2022 10:28 PM
Thanks Matt for your inputs.
Regarding the consumer group mentioned above, I understand there will be 32 to concurrent task which will be fetching the data from kafa however the consumer group ID will remain same and as long as consumer group ID remains same among the nodes the the ingestion load will be distributed equally to each node.
Regarding the max time driven threads configured, this is set to 128 which is exactly 4 times of all cores across nodes(8*4=32).
Regarding CPU wait time. Can you please elaborate as to how we can efficiently reduce wait time to improve the CPU usage or increasing number of cores is the only viable option here.
Thanks in advance,
Onkar
Created 12-12-2022 06:01 AM
@Onkar_Gagre
The Max Timer Driven Thread pool setting is applied to each node individually. NiFi nodes configured as a cluster ate expected to be running on same hardware configuration. The guidance of 2 to 4 times the number of cores as starting point is based on the cores of a single node in the cluster and not based off the cumulative cores across all NiFi cluster nodes. You can only reduce wait time as you reduce load on the CPU. In most cases, threads given out to most NiFi processors execute for only milliseconds. But some processors operating against the content can take several seconds or much longer depending on function of the processor and/or size of the content. When the CPU is saturated these threads will take even longer to complete as the CPU is giving time to each active thread. Knowing the only 8 threads at a time per node can actually execute concurrently, a thread only gets a short duration of time before giving some to another. The pauses in between are the CPU wait time as thread queued up wait for their turns to execute. So reducing the max Timer Driven Thread count (requires restart to reduction to be applied) would reduce maximum threads sent to CPU concurrently which would reduce CPU wait time. Of course the means less concurrency in NiFi. Sometimes you can reduce CPU through different flow designs, which is a much bigger discussion than can be handle efficiently via the community forums. Other times, your dataflow simply needs more CPU to handle the volumes and rates you are looking to achieve. CPU and Disk I/O are the biggest causes of slowed data processing.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt