Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

10-fold cross validation in Random Forests

Solved Go to solution

10-fold cross validation in Random Forests

Hello,

I am using this Scala code of MLlib about random forests. I wonder if this code uses 10-fold cross validation. If not, I would like to know how to do it in Scala.

Thanks,

Laia

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: 10-fold cross validation in Random Forests

Rising Star

No, that code is not using cross-validation. An example about how to use cross validation can be found here. It needs the DataFrame API, so you should refer to this for the Random Forest implementation.

View solution in original post

5 REPLIES 5
Highlighted

Re: 10-fold cross validation in Random Forests

Rising Star

No, that code is not using cross-validation. An example about how to use cross validation can be found here. It needs the DataFrame API, so you should refer to this for the Random Forest implementation.

View solution in original post

Highlighted

Re: 10-fold cross validation in Random Forests

Hello,

I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.

After executing this code:

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

import org.apache.spark.ml.classification.RandomForestClassifier

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.types._ import sqlContext.implicits._

import org.apache.spark.ml.attribute.NominalAttribute

import org.apache.spark.ml.feature.StringIndexer

val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")

val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData2, testData2) = (splits(0), splits(1))

val trainingData = trainingData2.toDF

val nFolds: Int = 10

val NumTrees: Int = 3

val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")

val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)

rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))

val pipeline = new Pipeline().setStages(Array(rf))

val paramGrid = new ParamGridBuilder().build()

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(trainingData)

Do you know where can be the problem?

Thanks,

Laia

Highlighted

Re: 10-fold cross validation in Random Forests

Rising Star

Could you please post the full stack trace of the exception? It looks like the indexer is not creating properly the label_idx column...

Highlighted

Re: 10-fold cross validation in Random Forests

Hello Marco,

Dan already answered the question here https://community.hortonworks.com/answers/55111/view.html

Thanks,

Laia

Highlighted

Re: 10-fold cross validation in Random Forests

New Contributor

I would like to perform a 10 CV with random forest on an RDD input. But I am having a problem when converting the RDD input to a DataFrame. I am using this code as you recommended:

import org.apache.spark.ml.Pipeline;

import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator};

import org.apache.spark.ml.classification.RandomForestClassifier;

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;

var inputPath = "..."

var text = sc.textFile(inputPath)

var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))

var header = rows.first()

val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))

val df = spark.createDataFrame(rows,schema)

val nFolds: Int = 10

val NumTrees: Int = 30

val metric: String = "accuracy"

val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") .setNumTrees(NumTrees)

val pipeline = new Pipeline().setStages(Array(rf)) val paramGrid = new ParamGridBuilder().build() // No parameter search

val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName(metric)

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)

val model = cv.fit(df) // trainingData: DataFrame

Any help please? Thank you.

Don't have an account?
Coming from Hortonworks? Activate your account here