Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: The Cloudera Community will undergo maintenance on Saturday, August 17 at 12:00am PDT. See more info here.

How does Spark MLLib handles NaN data?

How does Spark MLLib handles NaN data?

New Contributor

Hello!

 

I am testing some MLlib algorithms and was wondering about the way that Spark internally handles null or NaN values. Let me explain myself:

 

In both Java and Scala API's, most MLLib algorithms take a JavaRDD<Vector> as an argument. There are two different implementations for a Vector, dense and sparse.  DenseVector takes an array of doubles for its initialization, giving you the option to initialize the vector with Double.Nan values, but I don't have any clue of how Spark evaluates this NaN in an algorithm.

 

For example, I want to calculate the correlation coefficient matrix of several variables, but my data set has some null entries. I can initialize the Vectors with Double.Nan in case of null values, but I'm scared of the results.

 

Thank you in advance!

 

 

 

2 REPLIES 2

Re: How does Spark MLLib handles NaN data?

Master Collaborator

There is not one answer for this as the various implementations are fairly different. I think you'd have to try and see and/or trace the code. I assume you would get NaN correlations in this case.

Re: How does Spark MLLib handles NaN data?

New Contributor

Hi Ivan,

 

I'm interested in the same question.

How did you get this sorted out ?

 

mathieu