Created 09-05-2016 09:32 AM
Hello,
I am using this Scala code of MLlib about random forests. I wonder if this code uses 10-fold cross validation. If not, I would like to know how to do it in Scala.
Thanks,
Laia
Created 09-05-2016 10:38 AM
Created 09-05-2016 10:38 AM
Created 09-05-2016 12:46 PM
Hello,
I have the following error: java.lang.IllegalArgumentException: Field "label_idx" does not exist.
After executing this code:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types._ import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute
import org.apache.spark.ml.feature.StringIndexer
val unparseddata = sc.textFile("hdfs:///tmp/epidemiological16.csv")
val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData2, testData2) = (splits(0), splits(1))
val trainingData = trainingData2.toDF
val nFolds: Int = 10
val NumTrees: Int = 3
val rf = new RandomForestClassifier() .setNumTrees(NumTrees) .setFeaturesCol("features")
val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(trainingData)
rf.setLabelCol("label_idx").fit(indexer.transform(trainingData))
val pipeline = new Pipeline().setStages(Array(rf))
val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction")
val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)
val model = cv.fit(trainingData)
Do you know where can be the problem?
Thanks,
Laia
Created 09-06-2016 07:53 AM
Could you please post the full stack trace of the exception? It looks like the indexer is not creating properly the label_idx column...
Created 09-07-2016 07:42 AM
Hello Marco,
Dan already answered the question here https://community.hortonworks.com/answers/55111/view.html
Thanks,
Laia
Created 08-04-2017 02:47 PM
I would like to perform a 10 CV with random forest on an RDD input. But I am having a problem when converting the RDD input to a DataFrame. I am using this code as you recommended:
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator};
import org.apache.spark.ml.classification.RandomForestClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
var inputPath = "..."
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = spark.createDataFrame(rows,schema)
val nFolds: Int = 10
val NumTrees: Int = 30
val metric: String = "accuracy"
val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") .setNumTrees(NumTrees)
val pipeline = new Pipeline().setStages(Array(rf)) val paramGrid = new ParamGridBuilder().build() // No parameter search
val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName(metric)
val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)
val model = cv.fit(df) // trainingData: DataFrame
Any help please? Thank you.