Member since
02-24-2018
20
Posts
10
Kudos Received
1
Solution
03-13-2018
10:55 PM
Hi Raul, It definitely helped. I'm still very much a beginner when it comes to Nifi (I come from a data science background). I'm surprised that I didn't notice the Batch Duration property earlier, I'll play around with this. With regards to the "event size", where would this be set? I can't seem to find it within the processor. If you could, would you mind explaining why it's tricky to "synchronise batch generation and batch ingestion"? Is this under the assumption that I have multiple Nifi instances running for my system? The reason that I asked was because I was under the impression that Nifi was optimized for small chunks of data (thus single Scrapy items as FlowFiles sounded better than having the entire output of the spider as a FlowFile), but I suppose that this depends on the situation. I'll most likely split it after ingestion as you recommended. As a sanity check: I am routing the output from my spider to a ExecuteStreamCommand processor (a Python script) which analyzes the data and adds a few extra properties to the JSON. These JSON lines are then fed in to a Postgres DB. Would it make sense to do this directly, or would you advise using Kafka in this system (I haven't yet as I haven't studied it yet)? Thanks a ton. And as always, any advice is appreciated!
... View more
03-07-2018
11:19 PM
1 Kudo
First off, thanks a ton for this tutorial. I'm currently constructing a system which includes various spiders and this was a good starting point. When running the processors I noticed that it seems like ExecuteProcess only releases the output once the process (the spider) is done running. This leads to a single large FlowFile as opposed to what I expected: a single FlowFile for each extracted Scrapy Item (datadoc in your case). Do you know of a way to change this behavior? I could split up the FlowFile produced in to multiple (a JSON line each), but I feel like it would be a lot cleaner if this was the output of the Spider Processor in the first place. Any advice is appreciated.
... View more