Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

How does Spark MLLib handles NaN data?

How does Spark MLLib handles NaN data?

New Contributor



I am testing some MLlib algorithms and was wondering about the way that Spark internally handles null or NaN values. Let me explain myself:


In both Java and Scala API's, most MLLib algorithms take a JavaRDD<Vector> as an argument. There are two different implementations for a Vector, dense and sparse.  DenseVector takes an array of doubles for its initialization, giving you the option to initialize the vector with Double.Nan values, but I don't have any clue of how Spark evaluates this NaN in an algorithm.


For example, I want to calculate the correlation coefficient matrix of several variables, but my data set has some null entries. I can initialize the Vectors with Double.Nan in case of null values, but I'm scared of the results.


Thank you in advance!





Re: How does Spark MLLib handles NaN data?

Master Collaborator

There is not one answer for this as the various implementations are fairly different. I think you'd have to try and see and/or trace the code. I assume you would get NaN correlations in this case.


Re: How does Spark MLLib handles NaN data?

New Contributor

Hi Ivan,


I'm interested in the same question.

How did you get this sorted out ?