Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unexpected behaviour of dataset in spark

Unexpected behaviour of dataset in spark

Expert Contributor

Hi All,

 

I am trying to achieve data profiling using Apache spark. We have a requirement to estimate profiles such as null count, blank count, populated count, minimum and maximum at attribute level. For that, i have developed an spark code using dataset api as below.

sourceDf.schema.map({structfield=>

val column = structfield.name

val nullcount= sourceDf.filter(sourceDf(column).isNull).count()

val minimumValue = sourceDf.filter(sourceDf(column).isNull ll sourceDf(column) ==="").first().get(0).toInt

.

.

Similar way i have calculated other profilers using dataset api.

 

})

 

Note : The code pattern is iterative and it does calculation of every profiler on column level one by one.

 

Question: 

You can see in the code i have applied filter on dataset while null count calculation. The code after that one, i have  observered unpredictable behaviour of Dataset. Ideally,data inside dataset must be pass as it is for the other profiler calculation but that is not a case here. The minimum,maximum,blank are getting calculated on filtered dataset.

 

I do not have clue why filtered data is being passed to other profiler calculation and why not original dataset?

 

Please help to understand this issue and it's solution.

Thank you in advance

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here