Reply
New Contributor
Posts: 3
Registered: ‎05-13-2015

How does Spark MLLib handles NaN data?

Hello!

 

I am testing some MLlib algorithms and was wondering about the way that Spark internally handles null or NaN values. Let me explain myself:

 

In both Java and Scala API's, most MLLib algorithms take a JavaRDD<Vector> as an argument. There are two different implementations for a Vector, dense and sparse.  DenseVector takes an array of doubles for its initialization, giving you the option to initialize the vector with Double.Nan values, but I don't have any clue of how Spark evaluates this NaN in an algorithm.

 

For example, I want to calculate the correlation coefficient matrix of several variables, but my data set has some null entries. I can initialize the Vectors with Double.Nan in case of null values, but I'm scared of the results.

 

Thank you in advance!

 

 

 

Cloudera Employee
Posts: 481
Registered: ‎08-11-2014

Re: How does Spark MLLib handles NaN data?

There is not one answer for this as the various implementations are fairly different. I think you'd have to try and see and/or trace the code. I assume you would get NaN correlations in this case.

Highlighted
New Contributor
Posts: 2
Registered: ‎11-19-2015

Re: How does Spark MLLib handles NaN data?

Hi Ivan,

 

I'm interested in the same question.

How did you get this sorted out ?

 

mathieu