Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

nsert csv route erroneous lines to failure

Contributor

Hello,

I have a CSV file that I want to insert into a table. Sometimes I have string columns to insert in long columns that pose a normal problem. I'm looking for a way to put those lines aside ( in a flowfile ) and insert the rest. I'm using the putDatabaseRecord processor to insert the csv file.

Thank you

5 REPLIES 5

Hi @Amira khalifa,

You might want to have a look at ValidateRecord processor or ValidateCSV processor.

Hope this helps.

Hi @Amira khalifa

Follows an example implementation using ValidateCSV processor.

Sample Data

Name,age,gender,score
A,23,m,56
B,35,f,76
C,"54",m,87
D,19,f,56

String Columns - Name & Gender

Long Columns - Age & Score

Bad Record - Record #3. Has age as "54", a string value. Expected Long.

Sample flow

64658-screen-shot-2018-03-17-at-15449-am.png

Pay attention that ValidateCSV has two downstream relations, a valid and an invalid relation.

Follows the snapshot of the ValidateCSV processor.

64659-screen-shot-2018-03-17-at-20010-am.png

Pay attention to the schema, it has four functions separated by a comma and the actual value of the property is

ParseChar(), ParseLong(),ParseChar(), ParseLong()

This means, the incoming flow file has 4 columns and expected data types of the columns are Char, Long, Char, and Long respectively.

Also, I have set the Validation Strategy to "Line by line". This property will evaluate every line of every flow file individually and will filter out only bad records. The other option is "FlowFile validation", which will filter the entire record if the actual schema of all records does not match expected schema. You can choose according to your use case.

If you see, the valid and invalid relation both have 1 flow file each. If you list the queue and examine the flow files, you will find out that all the valid records, that record matching the expected schema, are in one flow file and have been redirected to the valid relation and rest of the records have been merged to one flow file and have been redirected to the invalid relation. Follows the snapshots for a quick illustration.

Valid relation queue

64660-screen-shot-2018-03-17-at-20743-am.png

Invalid relation queue

64661-screen-shot-2018-03-17-at-20806-am.png

You can do your required operations now with both the valid data matching your schema and invalid data which is not as per expectations!

Hope that helps!

Contributor

@Rahul Soni

thank you for your very detailed answer. The problem is that I have to define the schema of my table in validateCSV. Firstly in my CSV file I have 15 columns and I want to automate the work so that it is not dependent on a single file and therefore I avoid entering values manually.

There can be multiple ways of doing that. You may have an attribute associated to your files which will help you identify the file type and can use associated schema, that you will define one time, and can reuse it wherever you want.

This was just an example answer to show you the functionality of how the validation works.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.