Member since
08-01-2016
12
Posts
3
Kudos Received
0
Solutions
10-18-2016
04:49 PM
1 Kudo
Hello,
I would like to calculate the recall of each of the two classes I have in my dataset; and perform cross validation. But with the following code I obtain the weighted recall (only a recall value). val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("recall") //prediction, "f1", "precision", "recall", "weightedPrecision", "weightedRecall"
val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds) val model = cv.fit(trainingData) // trainingData: DataFrame val predictions = model.transform(testData) predictions.show()
val recall = evaluator.evaluate(predictions) println("Recall: " + recall)
Do you know what would be the code to compute the recall for each class in the cross validation?
Thanks,
Laia
... View more
Labels:
09-21-2016
08:27 AM
Hello, Now I can perform the hashing trick properly. I have tried: val documents: RDD[Seq[String]] = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv").map(_.split(",").toSeq) val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents) val splits = tfidfIgnore.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1))
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 10 val featureSubsetStrategy = "auto" val impurity = "gini"
val maxDepth = 8
val maxBins = 32 val trainingData2=LabeledPoint(0.0,trainingData.collect()) val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) But now I have the following error found : Array[org.apache.spark.mllib.linalg.Vector]
required: org.apache.spark.mllib.linalg.Vector
val trainingData2=LabeledPoint(0.0,trainingData.collect()) Do you know what can I do? Thanks Laia
... View more
09-19-2016
10:09 AM
Hello, I am dealing with genetic data and I would like to perform the hashing trick to reduce the number of features. I have written the following code: val unparseddata = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv")
val data = unparseddata.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))
}
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(data)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val idfIgnore = new IDF(minDocFreq = 2).fit(tf)
val tfidfIgnore: RDD[Vector] = idfIgnore.transform(tf)
val splits = tfidfIgnore.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1)) But I have this error: <console>:90: error: inferred type arguments [org.apache.spark.mllib.regression.LabeledPoint] do not conform to method transform's type parameter bounds [D <: Iterable[_]]
val tf: RDD[Vector] = hashingTF.transform(data) Do you know how can I solve this problem? Thank you, Laia
... View more
Labels:
09-07-2016
07:42 AM
Hello Marco, Dan already answered the question here https://community.hortonworks.com/answers/55111/view.html Thanks, Laia
... View more
09-07-2016
07:40 AM
Hello Dan, Thank you a lot for the help, it worked! In addition, I would like to have the recall, precision and f1 as well. And I would like to see the random forest trees as well. Do you know how I can do it? I have 2 imbalanced classes, so I would like to have them for each class... Best regards, Laia
... View more
09-06-2016
07:45 AM
1 Kudo
Hello, I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist. After executing this code: import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.types._ import sqlContext.implicits._ import org.apache.spark.ml.attribute.NominalAttribute import org.apache.spark.ml.feature.StringIndexer val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv") val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) } val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData2, testData2) = (splits(0), splits(1)) val trainingData = trainingData2.toDF val nFolds: Int = 10 val NumTrees: Int = 3 val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features") val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData) rf.setLabelCol("label_idx").fit(indexer.transform(trainingData)) val pipeline = new Pipeline().setStages(Array(rf)) val paramGrid = new ParamGridBuilder().build() val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds) val model = cv.fit(trainingData) Do you know where can be the problem? Thanks, Laia
... View more
Labels:
09-05-2016
12:46 PM
Hello, I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist. After executing this code: import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.types._
import sqlContext.implicits._ import org.apache.spark.ml.attribute.NominalAttribute import org.apache.spark.ml.feature.StringIndexer val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv") val data = unparseddata.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))
} val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData2, testData2) = (splits(0), splits(1)) val trainingData = trainingData2.toDF val nFolds: Int = 10 val NumTrees: Int = 3 val rf = new RandomForestClassifier()
.setNumTrees(NumTrees)
.setFeaturesCol("features") val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("label_idx")
.fit(trainingData) rf.setLabelCol("label_idx").fit(indexer.transform(trainingData)) val pipeline = new Pipeline().setStages(Array(rf))
val paramGrid = new ParamGridBuilder().build() val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction") val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds) val model = cv.fit(trainingData) Do you know where can be the problem? Thanks, Laia
... View more
09-05-2016
09:32 AM
Hello, I am using this Scala code of MLlib about random forests. I wonder if this code uses 10-fold cross validation. If not, I would like to know how to do it in Scala. Thanks, Laia
... View more
Labels:
08-17-2016
07:06 AM
Hi, I executed the following code to obtain a description of the PCA import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDDval
unparseddata = sc.textFile("hdfs:///tmp/epidemiological10.csv")
val data = unparseddata.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts.last, Vectors.dense(parts.slice(0, parts.length)))
}
val pca = new PCA(5).fit(data.map(_.features))
val projected = data.map(p => p.copy(features = pca.transform(p.features)))
val collect = projected.collect()
println("Projected vector of principal component:")
collect.foreach { vector => println(vector)} and I obtained the following result: (160.0,[-226.2602388674248,-28.5763504459316,-167.30588000588938,-169.403316284169,23.09294762015914]) (176.0,[-248.89483793051159,-21.97201619037966,-193.69749510702238,-108.81814406079761,20.90854574732602]) (179.0,[-253.1354367540671,-29.972928370070743,-244.2610705303066,-129.17921788251297,20.090356540571392]) (172.7,[-244.22812858428057,-21.1460977635957,-179.6413565398707,-106.6403738598213,23.450082340280513]) ... I assume that in brackets there are the five first components of the PCA but I could like to know what do the numbers I put in bold mean. Thanks in advance, Laia
... View more
Labels:
08-02-2016
05:34 AM
Hi, I am working with genetic data, for that reason I have so many columns. What other technique can I use to reduce dimensionality? I have out of memory problems when running random forest as well, for that reason I tried to do a PCA first. Thank you, Laia
... View more
08-01-2016
09:49 AM
1 Kudo
Hello, I want to perform a PCA with spark because I have many columns and I want to reduce dimensionality. The code is the following: val unparseddata = sc.textFile("hdfs:///tmp/new_data_binary_5.csv")
val data = unparseddata.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))
}
val pca = new PCA(5).fit(data.map(_.features))
val projected = data.map(p => p.copy(features = pca.transform(p.features))) And I have the following error: java.lang.IllegalArgumentException: Argument with more than 65535 cols: 4605966
at org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:132)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:329)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:389)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:46)
... View more
Labels: