I am trying to achieve data profiling using Apache spark. We have a requirement to estimate profiles such as null count, blank count, populated count, minimum and maximum at attribute level. For that, i have developed an spark code using dataset api as below.
val column = structfield.name
val nullcount= sourceDf.filter(sourceDf(column).isNull).count()
val minimumValue = sourceDf.filter(sourceDf(column).isNull ll sourceDf(column) ==="").first().get(0).toInt
Similar way i have calculated other profilers using dataset api.
Note : The code pattern is iterative and it does calculation of every profiler on column level one by one.
You can see in the code i have applied filter on dataset while null count calculation. The code after that one, i have observered unpredictable behaviour of Dataset. Ideally,data inside dataset must be pass as it is for the other profiler calculation but that is not a case here. The minimum,maximum,blank are getting calculated on filtered dataset.
I do not have clue why filtered data is being passed to other profiler calculation and why not original dataset?
Please help to understand this issue and it's solution.