- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Dataflow question and special case with duplicates
- Labels:
-
Apache NiFi
Created 01-31-2021 02:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello community,
I started using Apache NIFI for my bachelor-thesis. The basics of the data flow are already working. But there are some cases I can not really get the grasp on.
I get my files via HTTP, and they are mostly in TXT, CSV or XML.
How my workflow (data flow?) should look like:
- Multiple data sources (Question 1)
- Splitting the values in multiple lines
- Adding a timestamp as a column to each line (Question 2)
- Adding the source (name) as a column to each line (Question 2)
- Checking if the value was already seen (Question 3)
- Adding a new column to each line with the value "already seen" or "first seen" (Question 3)
- Merging the content
- Changing Filename
- PutFile (Question 4)
Question 1
Do I need to make a new Data flow for each new resource? Because otherwise they have all the same or a totally random file name at the end.
Question 2
If I add a column with the same value to each line, is it better to add the value before, at or after splitting the text?
Question 3
Right now my data gets saved in separate files, for example: dat_feed1.csv, data_feed2.csv.
How do I check if a value of the actual data flow is already in my locally saved data (CSV)?
I don't want to get rid of the duplicates. But I need to add a column which signalizes if the value was already seen or not. How is this possible?
Question 4
At last, I am struggling how to save my files, because I need them to be saved separately and additionally appended to a combined file. The separated files are basic and working fine. About appending to a file I read about different solutions, mostly about Groovy scripts.
Is ExecuteGroovyScripts the right way to go?
I hope you can help me, and I am looking forward to your answers.
Best regards
Maurice
Created 02-01-2021 06:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will try to nudge you in the right direction without spoiling everything:
Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.
Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.
Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.
Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
Created 02-01-2021 06:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will try to nudge you in the right direction without spoiling everything:
Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.
Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.
Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.
Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
Created 02-04-2021 01:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Dennis,
thank you for the reply. It really helped a lot!
Q1: That worked very well with the updateAttrbute processor.
Q2: This also worked. I had the the settings of the csvWriter service (UpdateRecord) messed up. But it works fine.
Q3: That is a bummer, I hoped it will would be a piece of cake to implement that. But i will look into one of the mentioned tools and figure it out.
Q4: True that. The file is going to explode with data.