Support Questions

Find answers, ask questions, and share your expertise

CSV file with Duplicate Headers

avatar
Explorer

Hi,

 

I have a csv file with dynamic columns and headers and there are multiple columns with same header name "SKIP"  that needs to be removed from the file before ingesting them into the database using PutDatabaseRecord. How can I delete the multiple columns with header name "SKIP'?

1 ACCEPTED SOLUTION

avatar
Master Guru

@budati 

 

For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.

 

So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.

 

Select * from flowfile where col1 not like 'SKIP'

Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.

View solution in original post

7 REPLIES 7

avatar
Master Mentor

@budati 

There is a good response by Burgess  that  should work out even for you  CSV with duplicate headers 

avatar
Explorer

Hi,

 

I think it is slightly different, it is remove header when a file is split in to smaller files where as for me it is one file with multiple columns with same header name and i need to ignore certain columns based on a column name.

avatar
Master Guru

@budati 

You can use QueryRecord processor and add new SQL query to select only the records that don't have value "SKIP" for the field by using Apache Calicite SQL parser.

-

For more reference regards to QueryRecord processor refer to this link.

avatar
Explorer

@Shu_ashu 

 

Skip is column header name, so how does the query look like to exclude columns whose header name is "SKIP"?

avatar
Master Guru

@budati 

Define Avro schema for record reader as col1 and col2...etc.

Treat first line as header property value as false

 

Add new query in QueryRecord processor as

select * from FLOWFILE where col1 != "SKIP"

(or)

select * from FLOWFILE where col1 <> "SKIP"

**NOTE** assuming col1 has "SKIP" in it.

 

For record writer define avro schema with your actual fileldnames.

Now queryrecord will exclude all the records that have "SKIP" in them and writes the flowfile with actual fieldnames in mentioned format.

avatar
Explorer

I think maybe I didn't explain it well. SKIP value is not in the rows, SKIP is in the column header. when we say  col1 <> 'SKIP' i believe it will skip all rows with value 'SKIP' but my intention is to remove columns whose header is called 'SKIP'

 

Here is an example of file header :

 

FirstName|Skip|Skip|City|State|ZipCode|Skip|LastVisitDate|Skip|ExtId

 

The file is not predefined, no of columns in a file will vary, header positions will vary, place where 'SKIP' can exist in the header will vary from file to file.

 

 

avatar
Master Guru

@budati 

 

For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.

 

So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.

 

Select * from flowfile where col1 not like 'SKIP'

Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.