Created 09-13-2019 12:55 PM
Hi,
I have a csv file with dynamic columns and headers and there are multiple columns with same header name "SKIP" that needs to be removed from the file before ingesting them into the database using PutDatabaseRecord. How can I delete the multiple columns with header name "SKIP'?
Created 09-21-2019 10:04 AM
For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.
So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.
Select * from flowfile where col1 not like 'SKIP'
Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.
Created 09-13-2019 05:13 PM
There is a good response by Burgess that should work out even for you CSV with duplicate headers
Created 09-17-2019 03:36 PM
Hi,
I think it is slightly different, it is remove header when a file is split in to smaller files where as for me it is one file with multiple columns with same header name and i need to ignore certain columns based on a column name.
Created 09-17-2019 09:31 PM
You can use QueryRecord processor and add new SQL query to select only the records that don't have value "SKIP" for the field by using Apache Calicite SQL parser.
-
For more reference regards to QueryRecord processor refer to this link.
Created 09-19-2019 04:17 PM
Skip is column header name, so how does the query look like to exclude columns whose header name is "SKIP"?
Created 09-20-2019 09:42 AM
Define Avro schema for record reader as col1 and col2...etc.
Treat first line as header property value as false
Add new query in QueryRecord processor as
select * from FLOWFILE where col1 != "SKIP"
(or)
select * from FLOWFILE where col1 <> "SKIP"
**NOTE** assuming col1 has "SKIP" in it.
For record writer define avro schema with your actual fileldnames.
Now queryrecord will exclude all the records that have "SKIP" in them and writes the flowfile with actual fieldnames in mentioned format.
Created 09-20-2019 01:33 PM
I think maybe I didn't explain it well. SKIP value is not in the rows, SKIP is in the column header. when we say col1 <> 'SKIP' i believe it will skip all rows with value 'SKIP' but my intention is to remove columns whose header is called 'SKIP'
Here is an example of file header :
FirstName|Skip|Skip|City|State|ZipCode|Skip|LastVisitDate|Skip|ExtId
The file is not predefined, no of columns in a file will vary, header positions will vary, place where 'SKIP' can exist in the header will vary from file to file.
Created 09-21-2019 10:04 AM
For this case define your avro schema(with one field) to read incoming flowfile with some delimiter that doesn't exist in flowfile.
So that whole row will be read as string then we can filter out the records by using not like
(or) using regex operator in apache calicite.
Select * from flowfile where col1 not like 'SKIP'
Now output flowfile will not having any records that have SKIP in them and this solution will work dynamically for any number of columns.