Support Questions
Find answers, ask questions, and share your expertise

Dataflow question and special case with duplicates

Solved Go to solution

Dataflow question and special case with duplicates

Explorer

Hello community,  

 

I started using Apache NIFI for my bachelor-thesis. The basics of the data flow are already working. But there are some cases I can not really get the grasp on. 

 

I get my files via HTTP, and they are mostly in TXT, CSV or XML.

 

How my workflow (data flow?) should look like:

- Multiple data sources (Question 1)

- Splitting the values in multiple lines

- Adding a timestamp as a column to each line (Question 2)

- Adding the source (name) as a column to each line (Question 2)

- Checking if the value was already seen (Question 3)

- Adding a new column to each line with the value "already seen" or "first seen" (Question 3)

- Merging the content

- Changing Filename

- PutFile (Question 4)

 

Question 1

Do I need to make a new Data flow for each new resource? Because otherwise they have all the same or a totally random file name at the end.

 

Question 2

If I add a column with the same value to each line, is it better to add the value before, at or after splitting the text?

 

Question 3

Right now my data gets saved in separate files, for example: dat_feed1.csv, data_feed2.csv.

How do I check if a value of the actual data flow is already in my locally saved data (CSV)?

I don't want to get rid of the duplicates. But I need to add a column which signalizes if the value was already seen or not. How is this possible?

 

Question 4

At last, I am struggling how to save my files, because I need them to be saved separately and additionally appended to a combined file. The separated files are basic and working fine. About appending to a file I read about different solutions, mostly about Groovy scripts. 

Is ExecuteGroovyScripts the right way to go? 

 

I hope you can help me, and I am looking forward to your answers. 

 

 

Best regards

 

Maurice

 
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Dataflow question and special case with duplicates

Expert Contributor

I will try to nudge you in the right direction without spoiling everything:

 

Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.

Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.

Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.

Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl

View solution in original post

2 REPLIES 2

Re: Dataflow question and special case with duplicates

Expert Contributor

I will try to nudge you in the right direction without spoiling everything:

 

Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.

Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.

Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.

Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl

View solution in original post

Re: Dataflow question and special case with duplicates

Explorer

Hello Dennis, 

 

thank you for the reply. It really helped a lot!

 

Q1: That worked very well with the updateAttrbute processor. 

Q2: This also worked. I had the the settings of the csvWriter service (UpdateRecord) messed up. But it works fine. 

Q3: That is a bummer, I hoped it will would be a piece of cake to implement that. But i will look into one of the mentioned tools and figure it out.

Q4: True that. The file is going to explode with data.