Support Questions
Find answers, ask questions, and share your expertise
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Unexpected behaviour of dataset in spark

Expert Contributor

Hi All,


I am trying to achieve data profiling using Apache spark. We have a requirement to estimate profiles such as null count, blank count, populated count, minimum and maximum at attribute level. For that, i have developed an spark code using dataset api as below.{structfield=>

val column =

val nullcount= sourceDf.filter(sourceDf(column).isNull).count()

val minimumValue = sourceDf.filter(sourceDf(column).isNull ll sourceDf(column) ==="").first().get(0).toInt



Similar way i have calculated other profilers using dataset api.




Note : The code pattern is iterative and it does calculation of every profiler on column level one by one.



You can see in the code i have applied filter on dataset while null count calculation. The code after that one, i have  observered unpredictable behaviour of Dataset. Ideally,data inside dataset must be pass as it is for the other profiler calculation but that is not a case here. The minimum,maximum,blank are getting calculated on filtered dataset.


I do not have clue why filtered data is being passed to other profiler calculation and why not original dataset?


Please help to understand this issue and it's solution.

Thank you in advance