New Contributor
Posts: 3
Registered: ‎05-13-2015

How does Spark MLLib handles NaN data?



I am testing some MLlib algorithms and was wondering about the way that Spark internally handles null or NaN values. Let me explain myself:


In both Java and Scala API's, most MLLib algorithms take a JavaRDD<Vector> as an argument. There are two different implementations for a Vector, dense and sparse.  DenseVector takes an array of doubles for its initialization, giving you the option to initialize the vector with Double.Nan values, but I don't have any clue of how Spark evaluates this NaN in an algorithm.


For example, I want to calculate the correlation coefficient matrix of several variables, but my data set has some null entries. I can initialize the Vectors with Double.Nan in case of null values, but I'm scared of the results.


Thank you in advance!




Cloudera Employee
Posts: 481
Registered: ‎08-11-2014

Re: How does Spark MLLib handles NaN data?

There is not one answer for this as the various implementations are fairly different. I think you'd have to try and see and/or trace the code. I assume you would get NaN correlations in this case.

New Contributor
Posts: 2
Registered: ‎11-19-2015

Re: How does Spark MLLib handles NaN data?

Hi Ivan,


I'm interested in the same question.

How did you get this sorted out ?