Created 03-24-2025 04:22 AM
Hello everyone,
I have developed an Apache NiFi flow that ingests streaming data, converts it to JSON, merges the records, and then stores them in HDFS. However, I have observed a couple of issues:
File Structure: After the merge operation, many files retain their original structure rather than being merged as expected.
Performance: Increasing the number of pins along with the minimum and maximum record counts and bin sizes causes the merge process to become significantly slower. Conversely, reducing these parameters results in files that mostly maintain their original structure.
Could anyone provide insights or recommendations on how to resolve these issues?
Thank you.
My crrunt cnfigration that cause the flow to go the orginal
current comigration where it causes go to original
The flow
the flow
The congigration that never showsnmerged records.
with this cofigration will take forever to merge I didn't see any output
Created 03-24-2025 08:26 AM
Hello @MattWho @Shelton @SAMSAL do you have any insights here? Thanks!
Regards,
Diana Torres,Created 03-24-2025 12:20 PM
@MarinaM
Suggested reading:
https://community.cloudera.com/t5/Community-Articles/How-to-address-JVM-OutOfMemory-errors-in-NiFi/t...
I'll start with this observation you shared:
"My current configuration that cause the flow to go the original"
The Merge processors will always send all input FlowFiles to the "Original" relationship when a bin is merged. The "Merged" relationship is where he newly produced FlowFiles will be output with the contents of the merged records.
Let break down your two different configurations to understand what each is doing:
- Here you are parsing inbound FlowFiles and allocating the content to a bin (each bin will contain only "like" records. A "like" record/FlowFile is one that has the exact same schema as another. At each scheduled execution a thread is requested that reads from the inbound connection queue and allocates FlowFiles to one or more bins. Judging by your screenshots, it appears you have numerous concurrent tasks configured on this processor. Here is where multiple different scenarios can happen:
Now looking at alternate configuration:
- Here you have increased min num records, min bin size, max bin size, and num of bins.
Things that can happen here:
My guess here is that your smaller configuration is working:
Hopefully this helps you understand how this processor works and helps you take a closer look at your inbound FlowFiles and their schemas to see if they are each unique.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 03-26-2025 01:55 AM
Dear Matt,
Thank you for your response; it was incredibly helpful.
Based on your feedback and the link you shared, I have revised my design. Below are the key points regarding my implementation:
All my data follows the same schema. Before merging, I convert the data to a JSON format to ensure consistency.
There are no failures occurring during the merge process.
The merge operation successfully produces a merged file.
However, I am still receiving the original output, which I need to eliminate.
Following the documentation, I have adjusted my design to include multiple consecutive merge steps, as illustrated in the attached diagram. Given the large volume of flow data, I have allocated high system specifications to optimise performance.
Additionally, my heap size is currently set to 42 GB.
Would increasing it further be beneficial?
To mitigate swap-related issues, I have adjusted the swap space to 300,000 (from 20,000) since I previously experienced rapid memory fill-up. Could you advise on any potential side effects of this change?
Looking forward to your insights.
Created 03-26-2025 10:12 PM
appriciate your response
Created 04-01-2025 08:47 AM
@MarinaM Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
Regards,
Diana Torres,