About zeinebchelly

zeinebchelly · ‎08-04-2017

I would like to perform a 10 CV with random forest on an RDD input. But I am having a problem when converting the RDD input to a DataFrame. I am using this code as you recommended: import org.apache.spark.ml.Pipeline; import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}; import org.apache.spark.ml.classification.RandomForestClassifier; import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator; var inputPath = "..." var text = sc.textFile(inputPath) var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a)) var header = rows.first() val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true))) val df = spark.createDataFrame(rows,schema) val nFolds: Int = 10 val NumTrees: Int = 30 val metric: String = "accuracy" val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") .setNumTrees(NumTrees) val pipeline = new Pipeline().setStages(Array(rf)) val paramGrid = new ParamGridBuilder().build() // No parameter search val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName(metric) val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds) val model = cv.fit(df) // trainingData: DataFrame Any help please? Thank you.

Online	Offline
Last Visited	‎08-04-2017 02:47 PM

Member Since	‎08-04-2017 10:47 AM
Last Visited	‎08-04-2017 02:47 PM
Posts	1

Cloudera Community

Re: 10-fold cross validation in Random Forests