Archives of Support Questions (Read Only)

laia_subirats · ‎09-06-2016

Hello,

I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.

After executing this code:

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

import org.apache.spark.ml.classification.RandomForestClassifier

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.types._ import sqlContext.implicits._

import org.apache.spark.ml.attribute.NominalAttribute

import org.apache.spark.ml.feature.StringIndexer

val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")

val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData2, testData2) = (splits(0), splits(1))

val trainingData = trainingData2.toDF

val nFolds: Int = 10

val NumTrees: Int = 3

val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")

val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)

rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))

val pipeline = new Pipeline().setStages(Array(rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

Do you know where can be the problem?

Thanks,

Laia

dzaratsian · ‎09-06-2016

Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.

Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.

I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.

---

val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")

val data = unparseddata.map {

line => val parts = line.split(',').map(_.toDouble)

LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))

}.toDF()

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

val nFolds: Int = 10

val NumTrees: Int = 3

val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")

val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")

val pipeline = new Pipeline().setStages(Array(indexer, rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

val predictions = model.transform(testData)

// Show model predictions

predictions.show()

val accuracy = evaluator.evaluate(predictions)

println("Accuracy: " + accuracy)

println("Error Rate: " + (1.0 - accuracy))

View solution in original post

dzaratsian · ‎09-06-2016

Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.

Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.

I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.

---

val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")

val data = unparseddata.map {

line => val parts = line.split(',').map(_.toDouble)

LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))

}.toDF()

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

val nFolds: Int = 10

val NumTrees: Int = 3

val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")

val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")

val pipeline = new Pipeline().setStages(Array(indexer, rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

val predictions = model.transform(testData)

// Show model predictions

predictions.show()

val accuracy = evaluator.evaluate(predictions)

println("Accuracy: " + accuracy)

println("Error Rate: " + (1.0 - accuracy))

laia_subirats · ‎09-07-2016

Hello Dan,

Thank you a lot for the help, it worked!

In addition, I would like to have the recall, precision and f1 as well. And I would like to see the random forest trees as well. Do you know how I can do it? I have 2 imbalanced classes, so I would like to have them for each class...

Best regards,

Laia