Support Questions
Find answers, ask questions, and share your expertise

Dataflow question and special case with duplicates

Explorer

Hello community,  

 

I started using Apache NIFI for my bachelor-thesis. The basics of the data flow are already working. But there are some cases I can not really get the grasp on. 

 

I get my files via HTTP, and they are mostly in TXT, CSV or XML.

 

How my workflow (data flow?) should look like:

- Multiple data sources (Question 1)

- Splitting the values in multiple lines

- Adding a timestamp as a column to each line (Question 2)

- Adding the source (name) as a column to each line (Question 2)

- Checking if the value was already seen (Question 3)

- Adding a new column to each line with the value "already seen" or "first seen" (Question 3)

- Merging the content

- Changing Filename

- PutFile (Question 4)

 

Question 1

Do I need to make a new Data flow for each new resource? Because otherwise they have all the same or a totally random file name at the end.

 

Question 2

If I add a column with the same value to each line, is it better to add the value before, at or after splitting the text?

 

Question 3

Right now my data gets saved in separate files, for example: dat_feed1.csv, data_feed2.csv.

How do I check if a value of the actual data flow is already in my locally saved data (CSV)?

I don't want to get rid of the duplicates. But I need to add a column which signalizes if the value was already seen or not. How is this possible?

 

Question 4

At last, I am struggling how to save my files, because I need them to be saved separately and additionally appended to a combined file. The separated files are basic and working fine. About appending to a file I read about different solutions, mostly about Groovy scripts. 

Is ExecuteGroovyScripts the right way to go? 

 

I hope you can help me, and I am looking forward to your answers. 

 

 

Best regards

 

Maurice

 
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Dataflow question and special case with duplicates

Super Collaborator

I will try to nudge you in the right direction without spoiling everything:

 

Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.

Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.

Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.

Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl

View solution in original post

2 REPLIES 2

Re: Dataflow question and special case with duplicates

Super Collaborator

I will try to nudge you in the right direction without spoiling everything:

 

Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.

Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.

Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.

Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl

View solution in original post

Re: Dataflow question and special case with duplicates

Explorer

Hello Dennis, 

 

thank you for the reply. It really helped a lot!

 

Q1: That worked very well with the updateAttrbute processor. 

Q2: This also worked. I had the the settings of the csvWriter service (UpdateRecord) messed up. But it works fine. 

Q3: That is a bummer, I hoped it will would be a piece of cake to implement that. But i will look into one of the mentioned tools and figure it out.

Q4: True that. The file is going to explode with data.