<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Find and remove duplicate entries - NIFI in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Find-and-remove-duplicate-entries-NIFI/m-p/349541#M235689</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;is there a way to find and remove duplicate entries in two flowfiles?&lt;/P&gt;&lt;P&gt;I have one flowfile generated with SQL Processor with entries from my database. The other one contains new and "old" entries. So in order to only write the new entries in the database, I have to detect and remove the entries that already exist in the other flowfile.&lt;/P&gt;&lt;P&gt;I already tried the HashContentProcessor but it hashes the content of the whole file. I would need a processor that hashes line for line (and then compares all hashes with each other).&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;</description>
    <pubDate>Thu, 04 Aug 2022 11:25:20 GMT</pubDate>
    <dc:creator>code</dc:creator>
    <dc:date>2022-08-04T11:25:20Z</dc:date>
    <item>
      <title>Find and remove duplicate entries - NIFI</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Find-and-remove-duplicate-entries-NIFI/m-p/349541#M235689</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;is there a way to find and remove duplicate entries in two flowfiles?&lt;/P&gt;&lt;P&gt;I have one flowfile generated with SQL Processor with entries from my database. The other one contains new and "old" entries. So in order to only write the new entries in the database, I have to detect and remove the entries that already exist in the other flowfile.&lt;/P&gt;&lt;P&gt;I already tried the HashContentProcessor but it hashes the content of the whole file. I would need a processor that hashes line for line (and then compares all hashes with each other).&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2022 11:25:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Find-and-remove-duplicate-entries-NIFI/m-p/349541#M235689</guid>
      <dc:creator>code</dc:creator>
      <dc:date>2022-08-04T11:25:20Z</dc:date>
    </item>
    <item>
      <title>Re: Find and remove duplicate entries - NIFI</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Find-and-remove-duplicate-entries-NIFI/m-p/349589#M235695</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/99564"&gt;@code&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Have you considered using &lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.16.3/org.apache.nifi.processors.standard.GenerateTableFetch/index.html" target="_self"&gt;GenerateTableFetch&lt;/A&gt;,&amp;nbsp;&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.16.3/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html" target="_self"&gt;QueryDatabaseTable&lt;/A&gt;, or&amp;nbsp;&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.16.3/org.apache.nifi.processors.standard.QueryDatabaseTableRecord/index.html" target="_self"&gt;QueryDatabaseTableRecord&lt;/A&gt;&amp;nbsp; &amp;nbsp;that generates SQL&amp;nbsp; that you then feed to the&amp;nbsp;&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.16.3/org.apache.nifi.processors.standard.ExecuteSQL/index.html" target="_self"&gt;ExecuteSQL&lt;/A&gt;&amp;nbsp;to avoid getting old and new entries with each execution of your existing flow?&amp;nbsp; Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.&lt;BR /&gt;&lt;BR /&gt;You can detect duplicates within a single FlowFile using&amp;nbsp;&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.16.3/org.apache.nifi.processors.standard.DeduplicateRecord/index.html" target="_self"&gt;DeduplicateRecord&lt;/A&gt;; however, this requires all records are merged in to a single FlowFile.&lt;BR /&gt;You can use&amp;nbsp;&lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.17.0/org.apache.nifi.processors.standard.DetectDuplicate/index.html" target="_self"&gt;DetectDuplicate&lt;/A&gt;; however, this requires that each FlowFile contains one entry to compare.&lt;BR /&gt;Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;If you found this response assisted with your query, please take a moment to login and click on "&lt;STRONG&gt;Accept as Solution&lt;/STRONG&gt;" below this post.&lt;BR /&gt;&lt;BR /&gt;Thank you,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2022 20:10:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Find-and-remove-duplicate-entries-NIFI/m-p/349589#M235695</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2022-08-04T20:10:59Z</dc:date>
    </item>
  </channel>
</rss>

