I have a CSV file that I want to insert into a table. Sometimes I have string columns to insert in long columns that pose a normal problem. I'm looking for a way to put those lines aside ( in a flowfile ) and insert the rest. I'm using the putDatabaseRecord processor to insert the csv file.
Follows an example implementation using ValidateCSV processor.
Name,age,gender,score A,23,m,56 B,35,f,76 C,"54",m,87 D,19,f,56
String Columns - Name & Gender
Long Columns - Age & Score
Bad Record - Record #3. Has age as "54", a string value. Expected Long.
Pay attention that ValidateCSV has two downstream relations, a valid and an invalid relation.
Follows the snapshot of the ValidateCSV processor.
Pay attention to the schema, it has four functions separated by a comma and the actual value of the property is
ParseChar(), ParseLong(),ParseChar(), ParseLong()
This means, the incoming flow file has 4 columns and expected data types of the columns are Char, Long, Char, and Long respectively.
Also, I have set the Validation Strategy to "Line by line". This property will evaluate every line of every flow file individually and will filter out only bad records. The other option is "FlowFile validation", which will filter the entire record if the actual schema of all records does not match expected schema. You can choose according to your use case.
If you see, the valid and invalid relation both have 1 flow file each. If you list the queue and examine the flow files, you will find out that all the valid records, that record matching the expected schema, are in one flow file and have been redirected to the valid relation and rest of the records have been merged to one flow file and have been redirected to the invalid relation. Follows the snapshots for a quick illustration.
Valid relation queue
Invalid relation queue
You can do your required operations now with both the valid data matching your schema and invalid data which is not as per expectations!
Hope that helps!
thank you for your very detailed answer. The problem is that I have to define the schema of my table in validateCSV. Firstly in my CSV file I have 15 columns and I want to automate the work so that it is not dependent on a single file and therefore I avoid entering values manually.
There can be multiple ways of doing that. You may have an attribute associated to your files which will help you identify the file type and can use associated schema, that you will define one time, and can reuse it wherever you want.