Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to validate schema in spark?

How to validate schema in spark?

Rising Star

Hi All,

We have a JSON file as input to the spark program(which describe schema definition, constraints which we want to check on each column though spark) and we want to perform some data quality checks(Not NULL, UNIQUE) and schema validations too.


JSON File:

{

"id":"1",

"name":"employee",

"source":"local",

"file_type":"text",

"sub_file_type":"csv",

"delimeter":",",

"path":"/user/all/dqdata/data/emp.txt",

"columns":[

{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},

{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},

{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},

{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},

{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}

]

}

Sample CSV input :

empId,empname,salar,dob,location

1,a,10000,11-03-2019,pune

2,b,10020,14-03-2019,pune

3,a,10010,15-03-2019,pune

a,1,10010,15-03-2019,pune

Keep in mind that,

1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in json file are not fixed?

Question:

How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?

We have tried below things:

1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weried way to validate schema.


Please guide me, How we can achieve it in spark?

Thanks in advance.

Don't have an account?
Coming from Hortonworks? Activate your account here