Support Questions

D5ha · ‎10-25-2022

when the NIFI processor is running (ExecuteSQL) we can see the size of the content_repository location is increasing.

while 3 execute SQL processors are running

I executed du -sh ./content_repository/* | grep 'G' 2 times and its output as below

1st time
5G ./content_repository/1000
4G ./content_repository/1009
6G ./content_repository/824
2nd time
8G ./content_repository/1000
6G ./content_repository/1009
8G ./content_repository/824

my concern is, is there any way to identify the specific content_repository location for each processor?

Azhar_Shaikh · ‎10-26-2022

Hello @D5ha

We have a community article that explains nifi's content repository.

https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Arc...

Easy way to get hold of your file is from Nifi Data Provenance.

https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.1.0/bk_getting-started-with-apache-nifi/content/da...

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs-up button.

ask_bill_brooks · ‎10-28-2022

@D5ha To specifically answer. your question:

is there any way to identify the specific content_repository location for each processor?

No, there is no such manner to do this.

MattWho · ‎10-28-2022

@D5ha

Not all processors write to the content repository nor is content of a FlowFile ever modified in the content after it is created. Once a FlowFile is created in NiFi it exists as is until terminated. A NiFi FlowFile consists of two parts, FlowFile Attributes (metatadata about the FlowFile which includes details about the FlowFIle's content location in the content_repository) and the FlowFile content itself. When a downstream processor modifies the content of a FlowFile, what is really happening is a new content is written to a new content claim in the content_repository, the original content still remains unchanged.

From what you shared, you appear to have just one content_repository. Within that single content_repository, NiFi creates a bunch of sub-directories. NiFi does this because of the massive number of content claims a user's dataflow(s) may hold for better indexing and seeking.

What is very important to also understand is that a content claim in the content_repository can hold the content for 1 or more FlowFiles. It is not always one content claim per FlowFiles content. It is also very possible to have multiple queued FlowFiles pointing to the exact same content claim and offset (same exact content). This happens when you dataflow clones a FlowFile (for example routing same outbound relationship from a processor multiple times). So you should never manually delete claims from any content repository as you may delete content for multiple FlowFiles.

That being said, you can use data provenance to locate the content_repository (container), subdirectory (section), Content Claim filename(Identifier), Content offset byte where content begins in that claim (Offset), and number of bytes from offset to end of content in the claim (Size).

Right click on a processor and select "view data provenance" from displayed context menu:

This will list all FlowFiles for which provenance still holds index data on that were processed by this processor:

Click the Show Lineage icon (looks like 3 connected circles) to the far right of a FlowFile. You can right click on "clone" and "join" events to find/expand any parent flowfiles in the lineage (the event dot created for the processor on which you said show provenance will be colored red in the lineage graph):

Each white circle is a different FlowFile. clicking on a white circle will highlight dataflow path for that FlowFile. Right clicking on an event like "create" and selecting "view details" will tell you all about what is known about that FlowFile (this includes a tab about the "content"):

Container corresponds to the following property in the nifi.properties file:
nifi.content.repository.directory.default=
Section corresponds to subdirectory within the above content repository path.
Identifier is the content claim filename.
Offset is the byte on which content for this FlowFile begins within that identifier.
Size is number of bytes of you reach end of content for that FlowFile's content in the Identifier.

I also created an article on how to index the Content Identifier. Indexing a field allows you to locate a content claim and the search for it in your data provenance to find all FlowFile(s) that pointed at it. You can then look view the details of all those FlowFile(s) to see full content calim details as above:
https://community.cloudera.com/t5/Community-Articles/How-to-determine-which-FlowFiles-are-associated...

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

Cloudera Community

Support Questions

Is there any way to identify content storage location use by specific processor