We have a JSON file as input to the spark program(which describe schema definition, constraints which we want to check on each column though spark) and we want to perform some data quality checks(Not NULL, UNIQUE) and schema validations too.
1) intentionally I have put the invalid data for empId and name field(check last record). 2) The number of column in json file are not fixed?
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID". Above scenario works only when we call some RDD action on data frame which I felt a weried way to validate schema.