Support Questions

Find answers, ask questions, and share your expertise

Description of the PCA

avatar
Contributor

Hi,

I executed the following code to obtain a description of the PCA

import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDDval 

unparseddata = sc.textFile("hdfs:///tmp/epidemiological10.csv")
val data = unparseddata.map { line =>
  val parts = line.split(',').map(_.toDouble)
  LabeledPoint(parts.last, Vectors.dense(parts.slice(0, parts.length)))
}

val pca = new PCA(5).fit(data.map(_.features))
val projected = data.map(p => p.copy(features = pca.transform(p.features)))

val collect = projected.collect()
println("Projected vector of principal component:")
collect.foreach { vector => println(vector)}

and I obtained the following result:

(160.0,[-226.2602388674248,-28.5763504459316,-167.30588000588938,-169.403316284169,23.09294762015914])

(176.0,[-248.89483793051159,-21.97201619037966,-193.69749510702238,-108.81814406079761,20.90854574732602])

(179.0,[-253.1354367540671,-29.972928370070743,-244.2610705303066,-129.17921788251297,20.090356540571392])

(172.7,[-244.22812858428057,-21.1460977635957,-179.6413565398707,-106.6403738598213,23.450082340280513])

...

I assume that in brackets there are the five first components of the PCA but I could like to know what do the numbers I put in bold mean.

Thanks in advance,

Laia

1 ACCEPTED SOLUTION

avatar
Guru

The bold number is the label from the LabelledPoint. Your map to create projected creates a copy of LabelledPoint replacing the features member with the principal components, but leaving the label untouched. Hence, you are getting the output tuple of (label, features) where features are your PCA result, and label is the original label.

View solution in original post

2 REPLIES 2

avatar
Guru

The bold number is the label from the LabelledPoint. Your map to create projected creates a copy of LabelledPoint replacing the features member with the principal components, but leaving the label untouched. Hence, you are getting the output tuple of (label, features) where features are your PCA result, and label is the original label.

avatar
Contributor

Thank you.