Support Questions
Find answers, ask questions, and share your expertise

Data validation in pyspark


Data validation in pyspark

New Contributor


1) Have a text file with 74 columns in each row .There are several 1000 rows.

2)Metadata (length and data type of each column of all rows) need to be validated.

3)Validate if datatype and length of each column is in synch with the metadata(client has provided a spreadsheet which has the metadata information)

4)If length and datatype does not match with metadata create a file(filename_exception) and throw an exception here with the row number.

5)If a field has white space (leading or trailing) put this in the exception file along with the row number

I am new to pySpark .Kindly help me .