05-26-2015 07:47 AM
I am testing some MLlib algorithms and was wondering about the way that Spark internally handles null or NaN values. Let me explain myself:
In both Java and Scala API's, most MLLib algorithms take a JavaRDD<Vector> as an argument. There are two different implementations for a Vector, dense and sparse. DenseVector takes an array of doubles for its initialization, giving you the option to initialize the vector with Double.Nan values, but I don't have any clue of how Spark evaluates this NaN in an algorithm.
For example, I want to calculate the correlation coefficient matrix of several variables, but my data set has some null entries. I can initialize the Vectors with Double.Nan in case of null values, but I'm scared of the results.
Thank you in advance!
05-26-2015 08:10 AM
There is not one answer for this as the various implementations are fairly different. I think you'd have to try and see and/or trace the code. I assume you would get NaN correlations in this case.