Created 07-05-2017 02:27 PM
Hi, i have been following some online examples in trying to build a model. I am using a csv data set, Below is a snippet of the headings and some of the data:
TrialID ObsNum IsAlert P1 P2 P3 P4 P5 P6 P7 P8 E1 E2
0 0 138.4294 10.9435 1000 60 0.302277 508 118.11 0 0 0
0 1 138.3609 15.3212 1000 600.302277 508 118.11 0 0 0
The third column, IsAlert is the ground truth
This is the code i have been trying, amongst some others.
val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val header = training.first
val inferSchema = true
val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// We can also use the multinomial family for binary classificationval
mlr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) .setFamily("multinomial")
val mlrModel = mlr.fit(training)
// Print the coefficients and intercepts for logistic regression with multinomial family
println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")
println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
This is the error i am recieving
import org.apache.spark.sql.types.{StructType, StructField, StringType}
training: org.apache.spark.rdd.RDD[String] = hdfs:///ford/fordTrain.csv MapPartitionsRDD[7] at textFile at <console>:188
header: String = TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
inferSchema: Boolean = true
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_1049bed7e9a0
<console>:192: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: org.apache.spark.sql.DataFrame
val lrModel = lr.fit(training)
^
I would be grateful for any help, thank you
Created 07-05-2017 10:12 PM
@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD.
You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints.
FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical.
import spark.implicits._
...
val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val df = training.toDF
// fixup your data to ensure your columns are the expected data type
...
val lrModel = lr.fit(df)
...
Created 07-05-2017 10:12 PM
@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD.
You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints.
FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical.
import spark.implicits._
...
val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val df = training.toDF
// fixup your data to ensure your columns are the expected data type
...
val lrModel = lr.fit(df)
...
Created 07-06-2017 12:18 PM