Support Questions
Find answers, ask questions, and share your expertise
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

How to achieve data profiling using spark?

Expert Contributor

Hi All,


I am going to use spark for gathering a profiling information for each attribute. 

Is it possible to do it using spark for below profiles


Profilers as follow:

1) Missing values: Find out null count and null percentage for every column.

2)  Unique values: Find out  unique count and unique percentage for every column.

In similar fashion, planning to calculate other's profilers as well such as blank count,populated count,data type length,duplicate count,min,max,median and average.


Duplicate count: how many times duplicate are occurred in specific column.

Unique count : Number of distinct values in a column.

Populated count: how many values are not null in a column



Is it possible to implement all above profiles using spark and how?


As per my knowledge, dataframe in spark has describe() function but it works only on numeric columns and not for string column. So, how to calculate min,max for string type column?


Please share your data points related to data profiling. Or if you are aware of any other link ,please share with me.


FYI. My source is hive table on which we are planning to implement profiling.