Support Questions

laia_subirats · ‎08-17-2016

Hi,

I executed the following code to obtain a description of the PCA

import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDDval 

unparseddata = sc.textFile("hdfs:///tmp/epidemiological10.csv")
val data = unparseddata.map { line =>
  val parts = line.split(',').map(_.toDouble)
  LabeledPoint(parts.last, Vectors.dense(parts.slice(0, parts.length)))
}

val pca = new PCA(5).fit(data.map(_.features))
val projected = data.map(p => p.copy(features = pca.transform(p.features)))

val collect = projected.collect()
println("Projected vector of principal component:")
collect.foreach { vector => println(vector)}

and I obtained the following result:

(160.0,[-226.2602388674248,-28.5763504459316,-167.30588000588938,-169.403316284169,23.09294762015914])

(176.0,[-248.89483793051159,-21.97201619037966,-193.69749510702238,-108.81814406079761,20.90854574732602])

(179.0,[-253.1354367540671,-29.972928370070743,-244.2610705303066,-129.17921788251297,20.090356540571392])

(172.7,[-244.22812858428057,-21.1460977635957,-179.6413565398707,-106.6403738598213,23.450082340280513])

...

I assume that in brackets there are the five first components of the PCA but I could like to know what do the numbers I put in bold mean.

Thanks in advance,

Laia

sball · ‎08-18-2016

The bold number is the label from the LabelledPoint. Your map to create projected creates a copy of LabelledPoint replacing the features member with the principal components, but leaving the label untouched. Hence, you are getting the output tuple of (label, features) where features are your PCA result, and label is the original label.

View solution in original post

sball · ‎08-18-2016

The bold number is the label from the LabelledPoint. Your map to create projected creates a copy of LabelledPoint replacing the features member with the principal components, but leaving the label untouched. Hence, you are getting the output tuple of (label, features) where features are your PCA result, and label is the original label.

laia_subirats · ‎08-18-2016

Thank you.

Cloudera Community

Support Questions

Description of the PCA