Created on 10-31-2018 12:55 PM - edited 08-17-2019 05:45 AM
NiFi's content repository will hold on to claims until there are no active FlowFiles still anywhere on the canvas referencing that claim. This can result in very old or very large claims being left in the content repo using up space. Something as small as a single 0 byte FlowFile that sitting in some queue may end up preventing a multi-gigabyte content claim from being removed. The intent of this article is help users understand how to find those active FlowFiles and clear them form your dataflow.
-
There is no simple UI feature or NiFi rest-api endpoint users can use to return information linking FlowFiles to exiting content claims; however, all is not lost. Users can add an additional indexed field to your provenance configuration to so that is starts indexing the "ContentClaimIdentifier" associated with each FlowFile event generated. This would give you the ability to use Provenance to identify all the FlowFiles associated to specific claim in the content repo.
-
To add this, simply add "ContentClaimIdentifier" to the list if existing indexed fields via the nifi.properties file. Here is the specific property you will be editing:
-
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship, ContentClaimIdentifier
-
NOTE: a restart of NiFi will need to occur before indexing of this new field will begin. NiFi will not go back and re-index existing events. It will only start adding this new indexed field to events created from this point forward.
-
You can then search your content repo for any content claim files that are of concern (for example searching for any very old or very large claims). Simply copy the claim number (filename of claim) and search for it via a provenance query. Once you have your list of FlowFile events all tied to same claim:
-
Above is example provenance query which returned below three FlowFiles:
I would then need to look at lineage of each of those FlowFiles to see if any of them made it to a DROP event (A DROP event means it is no longer in your dataflows anywhere.  Lineage can be displayed by clicking the "show lineage" icon to the right of any event from list .
-
Once you have found one or more with no DROP event, you will want to get details of last event by right clicking on it:
From those details you can collect the Component ID of for the processor that produced that event.
-
-
Go back to your main canvas and search on that component ID. Your FlowFile will be located in one of the outbound relationships to that component.
-
-
As you can see my FlowFile was found to be in this "success" relationship for an UpdateAttribute processor.
You will also notice that this connection contains many FlowFiles.  If I only want to purge this one FlowFile I am going to need to insert a RouteOnAttribute processor to this flow that I configure to route only a FlowFile with a specific FlowFile UUID which I could also get from my provenance event details above.
-
The routeOnAttribute processor added property would as simple as:
Property:  purgeValue:  ${uuid:equals('8297d10c-f6ca-4843-9593-320e5b265dd6')}-
"purge" becomes the new relationship which you can then just auto-terminate.
The "unmatched" relationship will get routed on in your flow and will contain every other FlowFile from this connection.
-
*** Please feel free to post any comments to this article of you have questions or see anything that needs to be added/clarified.
-
Thank you,
Matt
Created on 02-13-2025 12:47 PM
How did you know which fields could be indexed?
According to the official documentation this field doesn't even exist.
Thanks!
Created on 02-13-2025 01:22 PM
@OfekRo1 
I looked at the StandardProvenaceEventRecord source in Github plus I know many of the open source contributors. 😀
https://github.com/rdblue/incubator-nifi/blob/master/commons/data-provenance-utils/src/main/java/org...
Your welcome and thank you for being part of the community!
