<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: CSV file with Duplicate Headers in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277960#M207743</link>
    <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/69319"&gt;@budati&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Define Avro schema for record reader as col1 and col2...etc.&lt;/P&gt;&lt;P&gt;Treat first line as header property value as &lt;U&gt;&lt;STRONG&gt;false&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Add new query in QueryRecord processor as&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;select * from FLOWFILE where col1 != "SKIP"&lt;/LI-CODE&gt;&lt;P&gt;(or)&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;select * from FLOWFILE where col1 &amp;lt;&amp;gt; "SKIP"&lt;/LI-CODE&gt;&lt;P&gt;**NOTE** assuming &lt;STRONG&gt;col1&lt;/STRONG&gt; has "SKIP" in it.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For record writer define &lt;STRONG&gt;avro schema with your actual fileldnames&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Now queryrecord will exclude all the records that have "SKIP" in them and writes the flowfile with actual fieldnames in mentioned format.&lt;/P&gt;</description>
    <pubDate>Fri, 20 Sep 2019 16:42:03 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2019-09-20T16:42:03Z</dc:date>
    <item>
      <title>CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/270677#M207483</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a csv file with dynamic columns and headers and there are multiple columns with same header name "SKIP"&amp;nbsp; that needs to be removed from the file before ingesting them into the database using PutDatabaseRecord. How can I delete the multiple columns with header name "SKIP'?&lt;/P&gt;</description>
      <pubDate>Fri, 13 Sep 2019 19:55:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/270677#M207483</guid>
      <dc:creator>budati</dc:creator>
      <dc:date>2019-09-13T19:55:56Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/270687#M207488</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/69319"&gt;@budati&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is a good response by Burgess&amp;nbsp; that&amp;nbsp; should work out even for you&amp;nbsp; &lt;A href="https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-header-when-using-NiFi-SplitText-processor/m-p/227810/highlight/true#M189670" target="_blank" rel="noopener"&gt;CSV with duplicate headers&lt;/A&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 14 Sep 2019 00:13:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/270687#M207488</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2019-09-14T00:13:58Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277562#M207612</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think it is slightly different, it is remove header when a file is split in to smaller files where as for me it is one file with multiple columns with same header name and i need to ignore certain columns based on a column name.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Sep 2019 22:36:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277562#M207612</guid>
      <dc:creator>budati</dc:creator>
      <dc:date>2019-09-17T22:36:52Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277573#M207616</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/69319"&gt;@budati&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can use QueryRecord processor and &lt;STRONG&gt;add new SQL query to select only the records&lt;/STRONG&gt; that don't have value &lt;STRONG&gt;"SKIP"&lt;/STRONG&gt; for the field by using &lt;A href="https://calcite.apache.org/docs/reference.html" target="_self"&gt;Apache Calicite SQL parser&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;-&lt;/P&gt;&lt;P&gt;For more reference regards to QueryRecord processor refer to &lt;A href="https://community.cloudera.com/t5/Community-Articles/Running-SQL-on-FlowFiles-using-QueryRecord-Processor-Apache/ta-p/246671" target="_self"&gt;this&lt;/A&gt; link.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Sep 2019 04:31:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277573#M207616</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2019-09-18T04:31:59Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277894#M207710</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/55311"&gt;@Shu_ashu&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Skip is column header name, so how does the query look like to exclude columns whose header name is "SKIP"?&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 23:17:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277894#M207710</guid>
      <dc:creator>budati</dc:creator>
      <dc:date>2019-09-19T23:17:06Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277960#M207743</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/69319"&gt;@budati&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Define Avro schema for record reader as col1 and col2...etc.&lt;/P&gt;&lt;P&gt;Treat first line as header property value as &lt;U&gt;&lt;STRONG&gt;false&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Add new query in QueryRecord processor as&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;select * from FLOWFILE where col1 != "SKIP"&lt;/LI-CODE&gt;&lt;P&gt;(or)&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;select * from FLOWFILE where col1 &amp;lt;&amp;gt; "SKIP"&lt;/LI-CODE&gt;&lt;P&gt;**NOTE** assuming &lt;STRONG&gt;col1&lt;/STRONG&gt; has "SKIP" in it.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For record writer define &lt;STRONG&gt;avro schema with your actual fileldnames&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Now queryrecord will exclude all the records that have "SKIP" in them and writes the flowfile with actual fieldnames in mentioned format.&lt;/P&gt;</description>
      <pubDate>Fri, 20 Sep 2019 16:42:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277960#M207743</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2019-09-20T16:42:03Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277974#M207754</link>
      <description>&lt;P&gt;I think maybe I didn't explain it well. SKIP value is not in the rows, SKIP is in the column header. when we say&amp;nbsp; col1 &amp;lt;&amp;gt; 'SKIP' i believe it will skip all rows with value 'SKIP' but my intention is to remove columns whose header is called 'SKIP'&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is an example of file header :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;FirstName|Skip|Skip|City|State|ZipCode|Skip|LastVisitDate|Skip|ExtId&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The file is not predefined, no of columns in a file will vary, header positions will vary, place where 'SKIP' can exist in the header will vary from file to file.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Sep 2019 20:33:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277974#M207754</guid>
      <dc:creator>budati</dc:creator>
      <dc:date>2019-09-20T20:33:43Z</dc:date>
    </item>
    <item>
      <title>Re: CSV file with Duplicate Headers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277990#M207765</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/69319"&gt;@budati&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For this case define your avro schema(with one field)&amp;nbsp;to read incoming flowfile with some delimiter that doesn't exist in flowfile.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So that &lt;U&gt;&lt;STRONG&gt;whole row&lt;/STRONG&gt;&lt;/U&gt; will be &lt;STRONG&gt;read as string&lt;/STRONG&gt; then we can filter out the records by using &lt;STRONG&gt;not like&lt;/STRONG&gt;&lt;BR /&gt;(or) using &lt;STRONG&gt;regex&lt;/STRONG&gt; operator in apache calicite.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Select * from flowfile where col1 not like 'SKIP'&lt;/LI-CODE&gt;&lt;P&gt;Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.&lt;/P&gt;</description>
      <pubDate>Sat, 21 Sep 2019 17:04:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/CSV-file-with-Duplicate-Headers/m-p/277990#M207765</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2019-09-21T17:04:06Z</dc:date>
    </item>
  </channel>
</rss>

