Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Data validation in pyspark


Data validation in pyspark

New Contributor


1) Have a text file with 74 columns in each row .There are several 1000 rows.

2)Metadata (length and data type of each column of all rows) need to be validated.

3)Validate if datatype and length of each column is in synch with the metadata(client has provided a spreadsheet which has the metadata information)

4)If length and datatype does not match with metadata create a file(filename_exception) and throw an exception here with the row number.

5)If a field has white space (leading or trailing) put this in the exception file along with the row number

I am new to pySpark .Kindly help me .

Don't have an account?
Coming from Hortonworks? Activate your account here