Support Questions

budati · ‎09-13-2019

Hi,

I have a csv file with dynamic columns and headers and there are multiple columns with same header name "SKIP" that needs to be removed from the file before ingesting them into the database using PutDatabaseRecord. How can I delete the multiple columns with header name "SKIP'?

Shu_ashu · ‎09-21-2019

@budati

For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.

So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.

Select * from flowfile where col1 not like 'SKIP'

Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.

View solution in original post

Shelton · ‎09-13-2019

@budati

There is a good response by Burgess that should work out even for you CSV with duplicate headers

budati · ‎09-17-2019

Hi,

I think it is slightly different, it is remove header when a file is split in to smaller files where as for me it is one file with multiple columns with same header name and i need to ignore certain columns based on a column name.

Shu_ashu · ‎09-17-2019

@budati

You can use QueryRecord processor and add new SQL query to select only the records that don't have value "SKIP" for the field by using Apache Calicite SQL parser.

-

For more reference regards to QueryRecord processor refer to this link.

budati · ‎09-19-2019

@Shu_ashu

Skip is column header name, so how does the query look like to exclude columns whose header name is "SKIP"?

Shu_ashu · ‎09-20-2019

@budati

Define Avro schema for record reader as col1 and col2...etc.

Treat first line as header property value as false

Add new query in QueryRecord processor as

select * from FLOWFILE where col1 != "SKIP"

(or)

select * from FLOWFILE where col1 <> "SKIP"

**NOTE** assuming col1 has "SKIP" in it.

For record writer define avro schema with your actual fileldnames.

Now queryrecord will exclude all the records that have "SKIP" in them and writes the flowfile with actual fieldnames in mentioned format.

budati · ‎09-20-2019

I think maybe I didn't explain it well. SKIP value is not in the rows, SKIP is in the column header. when we say col1 <> 'SKIP' i believe it will skip all rows with value 'SKIP' but my intention is to remove columns whose header is called 'SKIP'

Here is an example of file header :

The file is not predefined, no of columns in a file will vary, header positions will vary, place where 'SKIP' can exist in the header will vary from file to file.

Shu_ashu · ‎09-21-2019

@budati

For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.

So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.

Select * from flowfile where col1 not like 'SKIP'

Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.

Cloudera Community

Support Questions

CSV file with Duplicate Headers