Support Questions

Find answers, ask questions, and share your expertise

10-fold cross validation in Random Forests

avatar
Contributor

Hello,

I am using this Scala code of MLlib about random forests. I wonder if this code uses 10-fold cross validation. If not, I would like to know how to do it in Scala.

Thanks,

Laia

1 ACCEPTED SOLUTION

avatar
Expert Contributor

No, that code is not using cross-validation. An example about how to use cross validation can be found here. It needs the DataFrame API, so you should refer to this for the Random Forest implementation.

View solution in original post

5 REPLIES 5

avatar
Expert Contributor

No, that code is not using cross-validation. An example about how to use cross validation can be found here. It needs the DataFrame API, so you should refer to this for the Random Forest implementation.

avatar
Contributor

Hello,

I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.

After executing this code:

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

import org.apache.spark.ml.classification.RandomForestClassifier

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.types._ import sqlContext.implicits._

import org.apache.spark.ml.attribute.NominalAttribute

import org.apache.spark.ml.feature.StringIndexer

val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")

val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData2, testData2) = (splits(0), splits(1))

val trainingData = trainingData2.toDF

val nFolds: Int = 10

val NumTrees: Int = 3

val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")

val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)

rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))

val pipeline = new Pipeline().setStages(Array(rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

Do you know where can be the problem?

Thanks,

Laia

avatar
Expert Contributor

Could you please post the full stack trace of the exception? It looks like the indexer is not creating properly the label_idx column...

avatar
Contributor

Hello Marco,

Dan already answered the question here https://community.hortonworks.com/answers/55111/view.html

Thanks,

Laia

avatar
New Contributor

I would like to perform a 10 CV with random forest on an RDD input. But I am having a problem when converting the RDD input to a DataFrame. I am using this code as you recommended:

import org.apache.spark.ml.Pipeline;

import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator};

import org.apache.spark.ml.classification.RandomForestClassifier;

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;

var inputPath = "..."

var text = sc.textFile(inputPath)

var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))

var header = rows.first()

val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))

val df = spark.createDataFrame(rows,schema)

val nFolds: Int = 10

val NumTrees: Int = 30

val metric: String = "accuracy"

val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") .setNumTrees(NumTrees)

val pipeline = new Pipeline().setStages(Array(rf)) val paramGrid = new ParamGridBuilder().build() // No parameter search

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName(metric)

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(df) // trainingData: DataFrame

Any help please? Thank you.