Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to achieve data profiling using spark?

How to achieve data profiling using spark?

Expert Contributor

Hi All,

 

I am going to use spark for gathering a profiling information for each attribute. 

Is it possible to do it using spark for below profiles

 

Profilers as follow:

1) Missing values: Find out null count and null percentage for every column.

2)  Unique values: Find out  unique count and unique percentage for every column.

In similar fashion, planning to calculate other's profilers as well such as blank count,populated count,data type length,duplicate count,min,max,median and average.

 

Duplicate count: how many times duplicate are occurred in specific column.

Unique count : Number of distinct values in a column.

Populated count: how many values are not null in a column

 

 

Is it possible to implement all above profiles using spark and how?

 

As per my knowledge, dataframe in spark has describe() function but it works only on numeric columns and not for string column. So, how to calculate min,max for string type column?

 

Please share your data points related to data profiling. Or if you are aware of any other link ,please share with me.

 

FYI. My source is hive table on which we are planning to implement profiling.

Don't have an account?
Coming from Hortonworks? Activate your account here