<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Duplicated rows are being generated during the use of GenerateTableFetch in incremental mode. in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385549#M245739</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am in the process of transferring data from an Oracle database to HDFS, formatted as Parquet. The workflow, as depicted in the attached screenshot, employs the GeneratedTableFetch method for data ingestion in segments. Additionally, the ExecuteSQL processor runs the generated queries, UpdateAttribute is utilized to add an attribute, and QueryRecord is used to introduce a new column in the flow files.&lt;/P&gt;&lt;P&gt;The source table contains approximately 22 million records. In this procedure, the 'Date' column is set as the 'Maximum-value Column' with the GenerateTableFetch processor, and a partition size of one million rows has been configured. This approach has allowed for the successful transfer and storage of all 22 million records into HDFS.&lt;/P&gt;&lt;P&gt;However, during data quality checks, I encountered some issues: there were missing rows and instances of duplicate records in the HDFS storage, a situation not mirrored in the source database, which maintained an accurate and duplicate-free record count.&lt;/P&gt;&lt;P&gt;Could you assist in pinpointing the potential reasons for these discrepancies?&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NiFi_Incremental.png" style="width: 588px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40150iDBED6B1F2B50E5D2/image-size/large?v=v2&amp;amp;px=999" role="button" title="NiFi_Incremental.png" alt="NiFi_Incremental.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="QueryRecord_NiFi.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40151iEEC9D3F509208B81/image-size/large?v=v2&amp;amp;px=999" role="button" title="QueryRecord_NiFi.png" alt="QueryRecord_NiFi.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="UpdateAttribute_NiFi.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40152iF596D51DF318F9BC/image-size/large?v=v2&amp;amp;px=999" role="button" title="UpdateAttribute_NiFi.png" alt="UpdateAttribute_NiFi.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 26 Mar 2024 10:41:36 GMT</pubDate>
    <dc:creator>arbenosm</dc:creator>
    <dc:date>2024-03-26T10:41:36Z</dc:date>
    <item>
      <title>Duplicated rows are being generated during the use of GenerateTableFetch in incremental mode.</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385549#M245739</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am in the process of transferring data from an Oracle database to HDFS, formatted as Parquet. The workflow, as depicted in the attached screenshot, employs the GeneratedTableFetch method for data ingestion in segments. Additionally, the ExecuteSQL processor runs the generated queries, UpdateAttribute is utilized to add an attribute, and QueryRecord is used to introduce a new column in the flow files.&lt;/P&gt;&lt;P&gt;The source table contains approximately 22 million records. In this procedure, the 'Date' column is set as the 'Maximum-value Column' with the GenerateTableFetch processor, and a partition size of one million rows has been configured. This approach has allowed for the successful transfer and storage of all 22 million records into HDFS.&lt;/P&gt;&lt;P&gt;However, during data quality checks, I encountered some issues: there were missing rows and instances of duplicate records in the HDFS storage, a situation not mirrored in the source database, which maintained an accurate and duplicate-free record count.&lt;/P&gt;&lt;P&gt;Could you assist in pinpointing the potential reasons for these discrepancies?&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NiFi_Incremental.png" style="width: 588px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40150iDBED6B1F2B50E5D2/image-size/large?v=v2&amp;amp;px=999" role="button" title="NiFi_Incremental.png" alt="NiFi_Incremental.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="QueryRecord_NiFi.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40151iEEC9D3F509208B81/image-size/large?v=v2&amp;amp;px=999" role="button" title="QueryRecord_NiFi.png" alt="QueryRecord_NiFi.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="UpdateAttribute_NiFi.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40152iF596D51DF318F9BC/image-size/large?v=v2&amp;amp;px=999" role="button" title="UpdateAttribute_NiFi.png" alt="UpdateAttribute_NiFi.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 10:41:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385549#M245739</guid>
      <dc:creator>arbenosm</dc:creator>
      <dc:date>2024-03-26T10:41:36Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows are being generated during the use of GenerateTableFetch in incremental mode.</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385572#M245749</link>
      <description>&lt;P&gt;Welcome to the community&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/109757"&gt;@arbenosm&lt;/a&gt;.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I tried to find you some resources to look over while waiting for an expert to respond, but didn't see any exact matches. I did see a couple of mentions however stating you should ensure you choose a timestamp column in your Oracle table that accurately reflects updates to insure you only fetch data that has changed since the last successful run. Hopefully that is helpful. Otherwise, maybe&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/95503"&gt;@steven-matison&lt;/a&gt;&amp;nbsp;or &lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/35454"&gt;@MattWho&lt;/a&gt;&amp;nbsp;may have some ideas.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 14:33:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385572#M245749</guid>
      <dc:creator>cjervis</dc:creator>
      <dc:date>2024-03-26T14:33:13Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows are being generated during the use of GenerateTableFetch in incremental mode.</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385595#M245759</link>
      <description>&lt;P&gt;&lt;SPAN&gt;the 'date' column in our table is indeed of the TIMESTAMP data type. &lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 21:41:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Duplicated-rows-are-being-generated-during-the-use-of/m-p/385595#M245759</guid>
      <dc:creator>arbenosm</dc:creator>
      <dc:date>2024-03-26T21:41:14Z</dc:date>
    </item>
  </channel>
</rss>

