Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

error: java.lang.IllegalArgumentException: Field "label_idx" does not exist

avatar
New Member

Hello,

I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.

After executing this code:

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

import org.apache.spark.ml.classification.RandomForestClassifier

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.types._ import sqlContext.implicits._

import org.apache.spark.ml.attribute.NominalAttribute

import org.apache.spark.ml.feature.StringIndexer

val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")

val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData2, testData2) = (splits(0), splits(1))

val trainingData = trainingData2.toDF

val nFolds: Int = 10

val NumTrees: Int = 3

val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")

val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)

rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))

val pipeline = new Pipeline().setStages(Array(rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

Do you know where can be the problem?

Thanks,

Laia

1 ACCEPTED SOLUTION

avatar

Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.

Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.

I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.

---

val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")

val data = unparseddata.map {

line => val parts = line.split(',').map(_.toDouble)

LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))

}.toDF()

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

val nFolds: Int = 10

val NumTrees: Int = 3

val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")

val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")

val pipeline = new Pipeline().setStages(Array(indexer, rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

val predictions = model.transform(testData)

// Show model predictions

predictions.show()

val accuracy = evaluator.evaluate(predictions)

println("Accuracy: " + accuracy)

println("Error Rate: " + (1.0 - accuracy))

View solution in original post

2 REPLIES 2

avatar

Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.

Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.

I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.

---

val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")

val data = unparseddata.map {

line => val parts = line.split(',').map(_.toDouble)

LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))

}.toDF()

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

val nFolds: Int = 10

val NumTrees: Int = 3

val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")

val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")

val pipeline = new Pipeline().setStages(Array(indexer, rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

val predictions = model.transform(testData)

// Show model predictions

predictions.show()

val accuracy = evaluator.evaluate(predictions)

println("Accuracy: " + accuracy)

println("Error Rate: " + (1.0 - accuracy))

avatar
New Member

Hello Dan,

Thank you a lot for the help, it worked!

In addition, I would like to have the recall, precision and f1 as well. And I would like to see the random forest trees as well. Do you know how I can do it? I have 2 imbalanced classes, so I would like to have them for each class...

Best regards,

Laia