About MattWho

MattWho · ‎03-17-2017

@Mohammed El Moumni Here is one possible dataflow design that can be used to make sure both FlowFiles in a pair end up on the same node after being distributed via the Remote Process Group (RPG): While it requires adding 5 additional processor to you flow, overhead is relatively light since you are dealing with very small FlowFiles all the way up to the point of the FetchFile processor. You are still only fetching the ~700 MB content after cluster distribution. Thanks, Matt

MattWho · ‎03-17-2017

@mayki wogno If you are a secured NiFi cluster, make sure all you nodes have been granted the "modify the data" access policy for those connections (or the containing process group if connections are inheriting policies). As a authenticated and authorized user, when you make a request logged in to one node, that request is replicated to the other nodes. So the purge of data is being done on your behalf by the node you are currently logged in to. Authorizing your nodes to be able to modify the data should allow you to empty the queue successfully. Another option is to temporarily set file expiration on the connection to 1 sec so that NiFi purges the queue itself. Just don't forget to change it back to avoid data loss when you don't want purging to occur. Of course as Bryan had noted, you can always top nifi and delete everything in FlowFile and content repositories to purge all data form you dataflow, but that may not always be desired. Thanks, Matt

MattWho · ‎03-17-2017

@Joshua Adeleke With NiFi dataflows it is typically to see large numbers of open file handles and user processes because of the concurrent thread operation supported by its many components. In many cases you will find that the default ulimits for both open files and processes will fall short of what is needed for most dataflows. I recommend setting these values to 50000 out the gate. You may find your self depending on volumes and complexity of your dataflow(s) needing to even set this higher. The default 1024 is almost always guaranteed to be an issue. /etc/security/limits.conf * hard nproc 50000 * soft nproc 50000 * hard nofiles 50000 * soft nofiles 50000 /etc/security/limits.d/90-nproc.conf * hard nproc 50000 * soft nproc 50000 Thanks, Matt

MattWho · ‎03-17-2017

@mayki wogno What type of downstream processor is the queue you are trying to empty connect to? Some processors such as the MergeContent processor when running have ownership of FlowFiles in the incoming queue. The MergeContent assigns FlowFiles on its incoming queue(s) to bins. You will not be able to clear the queue of any FlowFiles that are currently assigned to a bin. If you stop the processor downstream of your queue (Processor must show no running threads), can you then successfully empty the queue? Thanks, Matt

MattWho · ‎03-16-2017

@mayki wogno How are you executing your provenance query? Are you selecting "Data Provenance" from within the hamburger menu in the upper right of the Ui or are you selecting "Data provenance" from the context menu that appears by right clicking on your listHDFS processor? The above performs a global provenance query of all your dataflows by default unless you add a filter. triggering a provenance query through a specific components context menu will add a filter based upon that components UUID. Thanks, Matt

MattWho · ‎03-16-2017

@Thangarajan Pannerselvam If your GetFTP processor is configured with "delete original" set to false, every time this processor runs it will pull all th the files it finds including those pulled in the last run of the getFP processor. The ListFTP processor maintains state unlike the GetFTP processor. so if you replace your GetFTP with both ListFTP and FetchFTP processors, you will not see the same files pulled twice unless the timestamp on the files on the FTP server are updated. Thanks, Matt

MattWho · ‎03-15-2017

@mayki wogno Same question as this thread: https://community.hortonworks.com/questions/88962/nifi-processor-not-the-most-up-to-date.html

MattWho · ‎03-15-2017

@nyakkanti FlowFiles consist of FlowFile attributes and FlowFile content. - FlowFile attributes are kept in heap during processing and persisted to the FlowFile repository. - FlowFile content is kept in claims within the content repository. A claim is moved is moved to archive once their no longer exists any FlowFiles still active anywhere in your dataflow pointing at it. Archiving is enabled by default but can be disabled in the nifi.properties file: nifi.content.repository.archive.enabled=true If you disable archiving, the claim is purged from NiFi's content repository rather the being archived. What is important to understand is how claims work. By default in the nifi.properties file, claims can contain up to 100 FlowFiles or a min 10 MB of data (whichever occurs first). So a claim will not be purged until every piece of content in that claim has completed processing. As long as just one piece of content in that claim is still referenced, the entire claim will still exist in the content repository. As far as FlowFile attributes are concerned, they are persisted in NiFi provenance based on the configured retention in the nifi.properties file. You can perform provenance searches within NiFi to return FlowFile history and look at the attributes of those FlowFiles at any point int their lineage. Thanks, Matt

MattWho · ‎03-15-2017

@mayki wogno Are you issuing commands against the rest api or are you trying to make a change within the UI when this occurs? Sounds like multiple changes being made against the same component at the same time. Each component has a revision number so that two people can't make changes to the exact same component at the same time. So when the second change is applied using the same revision as the first request which was successful, you get these responses. Two ways this can occur... 1. Two authenticated users are making a change to the configuration of the same processor. User 1 hits apply and that change is applied. User 2 then hits apply and a conflict response occurs from the first node that receives the request. 2. Multiple rest api call are being made against the same component without updating the revision number in the subsequent rest api calls. As far as node going down.... Do you mean you lose the api and have to refresh the browser? or does the cluster completely go down forcing you to restart nodes to get nodes to rejoin cluster? Thanks, Matt

MattWho · ‎03-14-2017

@Raj B Thank you... Sometimes the most important piece of information is in the fine details. Other give away that it was clustered was that both FlowFiles in that queue had same position "1". Two FlowFiles in the same queue on the same node cannot occupy the same position.

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,118
Kudos received	1554

Cloudera Community

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: Nifi not starting because of OnScheduled()

Re: NiFi for file movement

Re: Issue with Nifi Merge Content : Files stay in ...

Re: Unable to clear Nifi Queue

Re: Number of user processes for nifi user: Failed...

Re: Unable to clear Nifi Queue

Re: NIFI : weird situation about data provenance

Re: How to use Conflict Resolution Strategy in Put...

Re: NIFI : processor not the most up-to-date

Re: Nifi, cleaning flow files after event processi...

Re: NIFI : processor not the most up-to-date

Re: Issue with Nifi Merge Content : Files stay in ...