question 10-fold cross validation in Random Forests in Archives of Support Questions (Read Only)

10-fold cross validation in Random Forests

laia_subirats — Mon, 05 Sep 2016 16:32:00 GMT

Hello,

I am using this Scala code of MLlib about random forests. I wonder if this code uses 10-fold cross validation. If not, I would like to know how to do it in Scala.

Thanks,

Laia

Re: 10-fold cross validation in Random Forests

mgaido — Mon, 05 Sep 2016 17:38:49 GMT

No, that code is not using cross-validation. An example about how to use cross validation can be found here. It needs the DataFrame API, so you should refer to this for the Random Forest implementation.

Re: 10-fold cross validation in Random Forests

laia_subirats — Mon, 05 Sep 2016 19:46:48 GMT

Hello,

I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.

After executing this code:

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

import org.apache.spark.ml.classification.RandomForestClassifier

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.types._ import sqlContext.implicits._

import org.apache.spark.ml.attribute.NominalAttribute

import org.apache.spark.ml.feature.StringIndexer

val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")

val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData2, testData2) = (splits(0), splits(1))

val trainingData = trainingData2.toDF

val nFolds: Int = 10

val NumTrees: Int = 3

val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")

val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)

rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))

val pipeline = new Pipeline().setStages(Array(rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

Do you know where can be the problem?

Thanks,

Laia

Re: 10-fold cross validation in Random Forests

mgaido — Tue, 06 Sep 2016 14:53:01 GMT

Could you please post the full stack trace of the exception? It looks like the indexer is not creating properly the label_idx column...

Re: 10-fold cross validation in Random Forests

laia_subirats — Wed, 07 Sep 2016 14:42:28 GMT

Hello Marco,

Dan already answered the question here https://community.hortonworks.com/answers/55111/view.html

Thanks,

Laia

Re: 10-fold cross validation in Random Forests

zeinebchelly — Fri, 04 Aug 2017 21:47:16 GMT

I would like to perform a 10 CV with random forest on an RDD input. But I am having a problem when converting the RDD input to a DataFrame. I am using this code as you recommended:

import org.apache.spark.ml.Pipeline;

import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator};

import org.apache.spark.ml.classification.RandomForestClassifier;

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;

var inputPath = "..."

var text = sc.textFile(inputPath)

var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))

var header = rows.first()

val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))

val df = spark.createDataFrame(rows,schema)

val nFolds: Int = 10

val NumTrees: Int = 30

val metric: String = "accuracy"

val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") .setNumTrees(NumTrees)

val pipeline = new Pipeline().setStages(Array(rf)) val paramGrid = new ParamGridBuilder().build() // No parameter search

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName(metric)

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(df) // trainingData: DataFrame

Any help please? Thank you.