Support Questions

Find answers, ask questions, and share your expertise

Incorrect fragment.count in nifi

avatar
Expert Contributor

I have a flow in NiFi which splits a file into individual lines, inserts those lines into a database and after those have been inserted updates a control table. The control table only updates after every line has been inserted. To achieve this, the fragment.index is compared to fragment.count - if these are equal, then I know that every line has been processed and we can move on to updating the control table.

However recently some of our files failed to update the control table. I have outputted the Attributes of the flow files to disk, and it shows something that confuses me: the number of flow files that comes out of the split text processor is 66430, which matches the number of lines in the file. However, the fragment.count attribute is 66443.

Does anybody know why the fragment index would be incorrect, and how I can fix this?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Mark Heydenrych

The default configuration of the SplitText processor is to not emit FlowFiles where the content is just a blank line. This behavior is controlled by the "Remove trailing Newlines" property. The fragment.count attributes is set based on the total number of fragments in the original FlowFile's content. The Fragment.index is is a one up number assigned to each FlowFile emitted. So in your case, i suspect that your original FlowFile's content contained 66,443 lines with 13 of those lines as just blank lines that were not emitted.

If you change "Remove trailing Newlines" to "false", your emitted count will match your Fragment.count.

Thanks,

Matt

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@Mark Heydenrych

The default configuration of the SplitText processor is to not emit FlowFiles where the content is just a blank line. This behavior is controlled by the "Remove trailing Newlines" property. The fragment.count attributes is set based on the total number of fragments in the original FlowFile's content. The Fragment.index is is a one up number assigned to each FlowFile emitted. So in your case, i suspect that your original FlowFile's content contained 66,443 lines with 13 of those lines as just blank lines that were not emitted.

If you change "Remove trailing Newlines" to "false", your emitted count will match your Fragment.count.

Thanks,

Matt

avatar
Master Mentor

@Mark Heydenrych

I generated an Apache Jira requesting a change to this behavior:

https://issues.apache.org/jira/browse/NIFI-4156

If you found this answer addressed your question, please mark answer as accepted.

Thank you, Matt

avatar
Expert Contributor

Hi Matt

I switched "Remove trailing Newlines" to false and got the number of fragments to 66443 as you suggested. This is a little confusing to me, as when I check the original file the number of lines is 66430. However your point is 100% correct. Thank you for opening the Jira request. While I wait for this, do you know of any useful workaround I can use in the time being to get the number of actually emitted fragments? It would be slower, but would it be possible, after the split, the merge the fragments (which would now include no newlines) and split them again?

Thanks, Mark

avatar
Master Mentor
@Mark Heydenrych

You may be able to use the ReplaceText processor to remove those blank lines from your input FlowFile's content before the SplitText processor.

I did a little test that worked for me using the following configuration:

16740-screen-shot-2017-07-07-at-91358-am.png

This evaluates your FlowFile line by line and replace the line return (\n) on any line where the line starts with a line return with nothing. The effectively removes that blank line. After that my splitText reported teh correct fragment.count when I split the file.

Thanks,

Matt