question Re: CSV file with Duplicate Headers in Support Questions

CSV file with Duplicate Headers

budati — Fri, 13 Sep 2019 19:55:56 GMT

Hi,

I have a csv file with dynamic columns and headers and there are multiple columns with same header name "SKIP" that needs to be removed from the file before ingesting them into the database using PutDatabaseRecord. How can I delete the multiple columns with header name "SKIP'?

Re: CSV file with Duplicate Headers

Shelton — Sat, 14 Sep 2019 00:13:58 GMT

@budati

There is a good response by Burgess that should work out even for you CSV with duplicate headers

Re: CSV file with Duplicate Headers

budati — Tue, 17 Sep 2019 22:36:52 GMT

Hi,

I think it is slightly different, it is remove header when a file is split in to smaller files where as for me it is one file with multiple columns with same header name and i need to ignore certain columns based on a column name.

Re: CSV file with Duplicate Headers

Shu_ashu — Wed, 18 Sep 2019 04:31:59 GMT

@budati

You can use QueryRecord processor and add new SQL query to select only the records that don't have value "SKIP" for the field by using Apache Calicite SQL parser.

For more reference regards to QueryRecord processor refer to this link.

Re: CSV file with Duplicate Headers

budati — Thu, 19 Sep 2019 23:17:06 GMT

@Shu_ashu

Skip is column header name, so how does the query look like to exclude columns whose header name is "SKIP"?

Re: CSV file with Duplicate Headers

Shu_ashu — Fri, 20 Sep 2019 16:42:03 GMT

@budati

Define Avro schema for record reader as col1 and col2...etc.

Treat first line as header property value as false

Add new query in QueryRecord processor as

select * from FLOWFILE where col1 != "SKIP"

(or)

select * from FLOWFILE where col1 <> "SKIP"

**NOTE** assuming col1 has "SKIP" in it.

For record writer define avro schema with your actual fileldnames.

Now queryrecord will exclude all the records that have "SKIP" in them and writes the flowfile with actual fieldnames in mentioned format.

Re: CSV file with Duplicate Headers

budati — Fri, 20 Sep 2019 20:33:43 GMT

I think maybe I didn't explain it well. SKIP value is not in the rows, SKIP is in the column header. when we say col1 <> 'SKIP' i believe it will skip all rows with value 'SKIP' but my intention is to remove columns whose header is called 'SKIP'

Here is an example of file header :

The file is not predefined, no of columns in a file will vary, header positions will vary, place where 'SKIP' can exist in the header will vary from file to file.

Re: CSV file with Duplicate Headers

Shu_ashu — Sat, 21 Sep 2019 17:04:06 GMT

@budati

For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.

So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.

Select * from flowfile where col1 not like 'SKIP'

Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.