question Re: Dataflow question and special case with duplicates in Support Questions

Dataflow question and special case with duplicates

Mandrill — Sun, 31 Jan 2021 10:31:52 GMT

Hello community,

I started using Apache NIFI for my bachelor-thesis. The basics of the data flow are already working. But there are some cases I can not really get the grasp on.

I get my files via HTTP, and they are mostly in TXT, CSV or XML.

How my workflow (data flow?) should look like:

- Multiple data sources (Question 1)

- Splitting the values in multiple lines

- Adding a timestamp as a column to each line (Question 2)

- Adding the source (name) as a column to each line (Question 2)

- Checking if the value was already seen (Question 3)

- Adding a new column to each line with the value "already seen" or "first seen" (Question 3)

- Merging the content

- Changing Filename

- PutFile (Question 4)

Question 1

Do I need to make a new Data flow for each new resource? Because otherwise they have all the same or a totally random file name at the end.

Question 2

If I add a column with the same value to each line, is it better to add the value before, at or after splitting the text?

Question 3

Right now my data gets saved in separate files, for example: dat_feed1.csv, data_feed2.csv.

How do I check if a value of the actual data flow is already in my locally saved data (CSV)?

I don't want to get rid of the duplicates. But I need to add a column which signalizes if the value was already seen or not. How is this possible?

Question 4

At last, I am struggling how to save my files, because I need them to be saved separately and additionally appended to a combined file. The separated files are basic and working fine. About appending to a file I read about different solutions, mostly about Groovy scripts.

Is ExecuteGroovyScripts the right way to go?

I hope you can help me, and I am looking forward to your answers.

Best regards

Maurice

Re: Dataflow question and special case with duplicates

DennisJaheruddi — Mon, 01 Feb 2021 14:16:39 GMT

I will try to nudge you in the right direction without spoiling everything:

Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.

Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.

Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.

Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.

Re: Dataflow question and special case with duplicates

Mandrill — Thu, 04 Feb 2021 09:57:40 GMT

Hello Dennis,

thank you for the reply. It really helped a lot!

Q1: That worked very well with the updateAttrbute processor.

Q2: This also worked. I had the the settings of the csvWriter service (UpdateRecord) messed up. But it works fine.

Q3: That is a bummer, I hoped it will would be a piece of cake to implement that. But i will look into one of the mentioned tools and figure it out.

Q4: True that. The file is going to explode with data.