I am going to use spark for gathering a profiling information for each attribute.
Is it possible to do it using spark for below profiles
Profilers as follow:
1) Missing values: Find out null count and null percentage for every column.
2) Unique values: Find out unique count and unique percentage for every column.
In similar fashion, planning to calculate other's profilers as well such as blank count,populated count,data type length,duplicate count,min,max,median and average.
Duplicate count: how many times duplicate are occurred in specific column.
Unique count : Number of distinct values in a column.
Populated count: how many values are not null in a column
Is it possible to implement all above profiles using spark and how?
As per my knowledge, dataframe in spark has describe() function but it works only on numeric columns and not for string column. So, how to calculate min,max for string type column?
Please share your data points related to data profiling. Or if you are aware of any other link ,please share with me.
FYI. My source is hive table on which we are planning to implement profiling.