Hi everyone - I've a Nifi question I was wondering if someone could help with. I've a sequence of steps as follows:- 1) A process group extracting data from multiple sources, merging it and storing it in a Mongo datastore. 2) A python script that needs to operate on this collection and outputs a separate collection (does de-duplication / record linkage). 3) Finally, another process group that reads this new collection from Mongo and publishes it to Elasticsearch. I'm not convinced `ExecuteScript` is the nicest way of handling it as the job could take an hour or two to run, plus debugging the runs and visibility over what it is doing seems quite brittle. Has anyone any ideas about a nicer way of handling this? I had a look at Wait/Notify but not quite sure if it fits my needs, nor how I'd communicate to the script that the extract was done, and similarly tell the publish step to read from the new collection. Thanks, Gavin.
... View more
Hi @Steven Matison Many thanks for your response. That all sounds very interesting, I have no experience with Kafka but I will check it out and see if it fits into what I’m trying to achieve here. Unless I’m misunderstanding - my major issue is not having a common identifier between datasets to deduplicate from so having to rely on an external tool (such as dedupe) to do some fancy data science work when clustering the duplicates e.g. looking at forename, surname, address and deciding if it should be clustered. There is also an element of training involved which would need to happen externally to further confuse things as it is an external tool. I suppose if I enriched the data with a common cluster id I could then fire this to Kafka for the data compaction bit which would match what you have above. Anyway, good to know I’m going along the right track so thanks again for your answer - ScrollElasticsearchHttp is interesting to read about! Cheers! Gavin.
... View more
Hi there! I’ve just heard about Apache Nifi through word of mouth and wondering if somebody could point me in the right direction with my use case - my team’s recently been thrown into the deep end with some requirements and would really appreciate the help. Problem: Our end game is to build a federated search of customers over a variety of large separate datasets which hold varying degrees of differing data about individuals, so it’s primarily an entity resolution problem. I was thinking Nifi could help query our various databases, merge the result, deduplicate the entries via an external tool and then push this result to an Elasticsearch instance for our applications querying. Roughly speaking something like this (haven’t tried implementing this flow yet!):- pasted-graphic-2.png So, for examples sake the following data in the result database from the first flow :- first.png Then run https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:- second.png Second flow would then feed this result into Elasticsearch instance for use by the API and front-end querying. Questions: Does this approach sound feasible? How would I trigger dedupe to run to ultimately cluster the duplicates after the merged content was pushed to the database? The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
Thanks for any insight anyone can give me regarding this, I’d be happy to consider any other bits of tech stack people might have if there was an entirely better way to approach it as I’d like this to be as robust as possible. I appreciate this isn’t primarily an Nifi question and I haven’t considered any CDC process here to capture updates to the datasets so I’d imagine this would get even more complicated… P.S. I’ve watched the HortonWorks talk here https://youtu.be/fblkgr1PJ0o?t=3149 which I found helpful and mentioned these community forums. Cheers, Gavin.
... View more