Created 08-17-2016 07:06 AM
Hi,
I executed the following code to obtain a description of the PCA
import org.apache.spark.mllib.feature.PCA import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.rdd.RDDval unparseddata = sc.textFile("hdfs:///tmp/epidemiological10.csv") val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last, Vectors.dense(parts.slice(0, parts.length))) } val pca = new PCA(5).fit(data.map(_.features)) val projected = data.map(p => p.copy(features = pca.transform(p.features))) val collect = projected.collect() println("Projected vector of principal component:") collect.foreach { vector => println(vector)}
and I obtained the following result:
(160.0,[-226.2602388674248,-28.5763504459316,-167.30588000588938,-169.403316284169,23.09294762015914])
(176.0,[-248.89483793051159,-21.97201619037966,-193.69749510702238,-108.81814406079761,20.90854574732602])
(179.0,[-253.1354367540671,-29.972928370070743,-244.2610705303066,-129.17921788251297,20.090356540571392])
(172.7,[-244.22812858428057,-21.1460977635957,-179.6413565398707,-106.6403738598213,23.450082340280513])
...
I assume that in brackets there are the five first components of the PCA but I could like to know what do the numbers I put in bold mean.
Thanks in advance,
Laia
Created 08-18-2016 09:44 AM
The bold number is the label from the LabelledPoint. Your map to create projected creates a copy of LabelledPoint replacing the features member with the principal components, but leaving the label untouched. Hence, you are getting the output tuple of (label, features) where features are your PCA result, and label is the original label.
Created 08-18-2016 09:44 AM
The bold number is the label from the LabelledPoint. Your map to create projected creates a copy of LabelledPoint replacing the features member with the principal components, but leaving the label untouched. Hence, you are getting the output tuple of (label, features) where features are your PCA result, and label is the original label.
Created 08-18-2016 09:56 AM
Thank you.