Created 09-06-2016 07:45 AM
Hello,
I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.
After executing this code:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types._ import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute
import org.apache.spark.ml.feature.StringIndexer
val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")
val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData2, testData2) = (splits(0), splits(1))
val trainingData = trainingData2.toDF
val nFolds: Int = 10
val NumTrees: Int = 3
val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")
val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)
rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))
val pipeline = new Pipeline().setStages(Array(rf))
val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")
val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)
val model = cv.fit(trainingData)
Do you know where can be the problem?
Thanks,
Laia
Created 09-06-2016 03:41 PM
Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.
Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.
I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.
---
val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")
val data = unparseddata.map {
line => val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))
}.toDF()
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
val nFolds: Int = 10
val NumTrees: Int = 3
val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")
val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")
val pipeline = new Pipeline().setStages(Array(indexer, rf))
val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")
val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)
val model = cv.fit(trainingData)
val predictions = model.transform(testData)
// Show model predictions
predictions.show()
val accuracy = evaluator.evaluate(predictions)
println("Accuracy: " + accuracy)
println("Error Rate: " + (1.0 - accuracy))
Created 09-06-2016 03:41 PM
Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.
Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.
I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.
---
val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")
val data = unparseddata.map {
line => val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))
}.toDF()
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
val nFolds: Int = 10
val NumTrees: Int = 3
val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")
val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")
val pipeline = new Pipeline().setStages(Array(indexer, rf))
val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")
val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)
val model = cv.fit(trainingData)
val predictions = model.transform(testData)
// Show model predictions
predictions.show()
val accuracy = evaluator.evaluate(predictions)
println("Accuracy: " + accuracy)
println("Error Rate: " + (1.0 - accuracy))
Created 09-07-2016 07:40 AM
Hello Dan,
Thank you a lot for the help, it worked!
In addition, I would like to have the recall, precision and f1 as well. And I would like to see the random forest trees as well. Do you know how I can do it? I have 2 imbalanced classes, so I would like to have them for each class...
Best regards,
Laia