<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Dataflow question and special case with duplicates in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310740#M224280</link>
    <description>&lt;P&gt;I will try to nudge you in the right direction without spoiling everything:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.&lt;/P&gt;&lt;P&gt;Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.&lt;/P&gt;&lt;P&gt;Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.&lt;/P&gt;&lt;P&gt;Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.&lt;/P&gt;</description>
    <pubDate>Mon, 01 Feb 2021 14:16:39 GMT</pubDate>
    <dc:creator>DennisJaheruddi</dc:creator>
    <dc:date>2021-02-01T14:16:39Z</dc:date>
    <item>
      <title>Dataflow question and special case with duplicates</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310692#M224257</link>
      <description>&lt;P&gt;Hello community,&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I started using Apache NIFI for my bachelor-thesis. The basics of the data flow are already working. But there are some cases I can not really get the grasp on.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I get my files via HTTP, and they are mostly in TXT, CSV or XML.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How my workflow (data flow?) should look like:&lt;/P&gt;&lt;P&gt;- Multiple data sources (Question 1)&lt;/P&gt;&lt;P&gt;- Splitting the values in multiple lines&lt;/P&gt;&lt;P&gt;- Adding a timestamp as a column to each line (Question 2)&lt;/P&gt;&lt;P&gt;- Adding the source (name) as a column to each line (Question 2)&lt;/P&gt;&lt;P&gt;- Checking if the value was already seen (Question 3)&lt;/P&gt;&lt;P&gt;- Adding a new column to each line with the value "already seen" or "first seen" (Question 3)&lt;/P&gt;&lt;P&gt;- Merging the content&lt;/P&gt;&lt;P&gt;- Changing Filename&lt;/P&gt;&lt;P&gt;- PutFile (Question 4)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Question 1&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Do I need to make a new Data flow for each new resource?&lt;/STRONG&gt;&amp;nbsp;Because otherwise they have all the same or a totally random file name at the end.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Question 2&lt;/P&gt;&lt;P&gt;If I add a column with the same value to each line, &lt;STRONG&gt;is it better to add the value before, at or after splitting the text?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Question 3&lt;/P&gt;&lt;P&gt;Right now my data gets saved in separate files, for example: dat_feed1.csv, data_feed2.csv.&lt;/P&gt;&lt;P&gt;How do I check if a &lt;STRONG&gt;value of the actual data flow is already in my locally saved data (CSV)?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I don't want to get rid of the duplicates. But I need to &lt;STRONG&gt;add a column which signalizes if the value was already seen or not&lt;/STRONG&gt;. How is this possible?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Question 4&lt;/P&gt;&lt;P&gt;At last, I am struggling how to save my files, because I need them to be saved separately and additionally appended to a combined file. The separated files are basic and working fine. About appending to a file I read about different solutions, mostly about Groovy scripts.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Is ExecuteGroovyScripts the right way to go?&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I hope you can help me, and I am looking forward to your answers.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best regards&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Maurice&lt;/P&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Sun, 31 Jan 2021 10:31:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310692#M224257</guid>
      <dc:creator>Mandrill</dc:creator>
      <dc:date>2021-01-31T10:31:52Z</dc:date>
    </item>
    <item>
      <title>Re: Dataflow question and special case with duplicates</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310740#M224280</link>
      <description>&lt;P&gt;I will try to nudge you in the right direction without spoiling everything:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Q1: Look into attributes, you could think of having a processor give an attribute to the flowfile when it is loaded in, this can later be used to route or name files.&lt;/P&gt;&lt;P&gt;Q2: If it is possible to use recordbased processors and avoid splitting files into individual records...do it. It can be 100x more efficient.&lt;/P&gt;&lt;P&gt;Q3: Nifi is great for working with individual messages, not so much for working with context (e.g. is a message a duplicate). I suppose you could do some kind of lookup of new messages against existing messages...but you should avoid this where possible. Think about something like spark/flink or even python or SQL batch solutions to detect duplicates.&lt;/P&gt;&lt;P&gt;Q4: I don't think you will soon run into NiFi limitations here, the question is probably more what file format can take all the updates and still perform well enough.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Feb 2021 14:16:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310740#M224280</guid>
      <dc:creator>DennisJaheruddi</dc:creator>
      <dc:date>2021-02-01T14:16:39Z</dc:date>
    </item>
    <item>
      <title>Re: Dataflow question and special case with duplicates</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310988#M224399</link>
      <description>&lt;P&gt;Hello Dennis,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thank you for the reply. It really helped a lot!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Q1: That worked very well with the updateAttrbute processor.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Q2: This also worked. I had the the settings of the csvWriter service (UpdateRecord) messed up. But it works fine.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Q3: That is a bummer, I hoped it will would be a piece of cake to implement that. But i will look into one of the mentioned tools and figure it out.&lt;/P&gt;&lt;P&gt;Q4: True that. The file is going to explode with data.&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Feb 2021 09:57:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Dataflow-question-and-special-case-with-duplicates/m-p/310988#M224399</guid>
      <dc:creator>Mandrill</dc:creator>
      <dc:date>2021-02-04T09:57:40Z</dc:date>
    </item>
  </channel>
</rss>

