Member since
11-30-2021
2
Posts
0
Kudos Received
0
Solutions
08-01-2022
10:41 PM
@Hafiz Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
... View more
12-02-2021
08:19 AM
@Hafiz The ExtractText processor will evaluate a Java regular expression containing capture group against the inbound FlowFile's content. Then creates FlowFile attributes by processor dynamic property name that is assigned the value from the capture group from that Java regular expression. Above would result in FlowFiles with attributes like: Things to keep in mind. SplitText takes the inbound FlowFile and splits it in too many FlowFiles. If you are producing a lot of splits from a single source FlowFile, it will have an impact of NiFi's heap usage during that process. As each Split FlowFile is created, the FlowFile attributes/metadata fro each produced FlowFile (splits) is held in heap memory. Once all splits are created, all those produced Split FlowFiles are committed to the downstream relationship. One on the relationship, NiFi can then swap as needed out of heap memory. NiFi does this to avoid data duplication. Let's say you have a split that is in progress and NiFi dies. Since nothing has been committed to a downstream relationship yet, when NiFi is brought back online, it will reprocess the original FlowFile. You can reduce heap usage by splitting your source File multiple times if it is large (more than 20,000 - 50,000 splits). For example, split by every 5,000 lines in first SplitText and then by every 1 line in second SplitText. While NiFi does not hold FlowFile content in heap memory (Some processor will load content in to heap to execute on that content), FlowFile attributes/metadata is held in heap memory. So the more attributes/metadata exists on a FlowFile, the more heap that FlowFile is going to use. FlowFiles are held in connection between processor components. NiFi has a connection swap threshold that is applied per connection. The default is to produce swap files that contain 10,000 FlowFiles each (these swap files are for FlowFile attributes/metadata and not content since it is not always held in heap). So swap default set in nifi.properties file is 20,000. This means the first swap file for a connection is generated connection reaches 20,000 queued FlowFiles on one node (if multi-node NiFi cluster, swap is per node and not across all nodes) Just keep above in mind when designing dataflows where you are splitting/merging, creating a lot of FlowFile Attributes, or creating FlowFile attributes with large values. If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt
... View more