Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi Incremental ingest?

avatar
Rising Star

I have some external rest apis that I have to query for data periodically using InvokeHTTP. I'd like to pass in the date as a query arg which I last extracted data to only retrieve the incremental changes. What are the best practices on how to do this with Nifi? Should I

* Use an external database table to update/query the last date?

* Is there a different built in mechanism I can use to accomplish this?

Currently, I'm just using ${now():toNumber():minus(86400000):format('yyyy-MM-dd')} to get the last day's date and passing this in to the rest api, but this isn't a good way to do it because if my daily load fails one day then the next day I will skip it.

1 ACCEPTED SOLUTION
6 REPLIES 6

avatar
Rising Star

Great! Thanks I will play with this. Is there a way to know when the whole workflow is complete? The last step in my workflow writes the data to a file, but it doesn't always come at once. Some items may be waiting in one of the queues or whatever. Suggestions?

avatar
Master Guru

hit refresh look at data provenance

you can see numbers in queues if things are still processing

avatar
New Contributor

Hi

According to this https://community.hortonworks.com/questions/103459/clarifications-on-state-management-within-nifi-pr... and my research -

I understand that DistributedMapCache is not actually distributed and it runs on individual nodes. If the node running the server fails then the data is gone. Also, it is a cache server so has an eviction strategy, though it gives the option of persistence directory but that does not solve anytime availability problem. When we want to store some temporary state then it may be good but for long term persistent state we should rather rely on Zookeeper for its distributed nature. Unfortunately, I could not find any processor for putting data in Zookeeper. Other option would be to use database or distributed storage like HDFS, S3 etc.

Please correct me if I am wrong anywhere.

PS: I have the same case where I want to get the data from an API and wants to store the time upto which I have already requested the data.

avatar
Rising Star

@Harsh ChoudharyAgreed. I came to the conclusion that the distributed map cache is too flakey to keep track of important things. We've seen it mysteriously fail several times and have since changed all our processes to use a database.

avatar
Master Guru

There has been a major upgrade to cache in Apache NiFi 1.4 and now you can use Redis!