Created 07-05-2017 02:27 PM
Hi, i have been following some online examples in trying to build a model. I am using a csv data set, Below is a snippet of the headings and some of the data:
TrialID ObsNum IsAlert P1 P2 P3 P4 P5 P6 P7 P8 E1 E2
0 0 138.4294 10.9435 1000 60 0.302277 508 118.11 0 0 0
0 1 138.3609 15.3212 1000 600.302277 508 118.11 0 0 0
The third column, IsAlert is the ground truth
This is the code i have been trying, amongst some others.
val training = sc.textFile("hdfs:///ford/fordTrain.csv") val header = training.first val inferSchema = true val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) // Fit the model val lrModel = lr.fit(training) // Print the coefficients and intercept for logistic regression println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") // We can also use the multinomial family for binary classificationval mlr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) .setFamily("multinomial") val mlrModel = mlr.fit(training) // Print the coefficients and intercepts for logistic regression with multinomial family println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}") println(s"Multinomial intercepts: ${mlrModel.interceptVector}") This is the error i am recieving import org.apache.spark.sql.types.{StructType, StructField, StringType} training: org.apache.spark.rdd.RDD[String] = hdfs:///ford/fordTrain.csv MapPartitionsRDD[7] at textFile at <console>:188 header: String = TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11 inferSchema: Boolean = true lr: org.apache.spark.ml.classification.LogisticRegression = logreg_1049bed7e9a0 <console>:192: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.sql.DataFrame val lrModel = lr.fit(training) ^ I would be grateful for any help, thank you
Created 07-05-2017 10:12 PM
@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD.
You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints.
FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical.
import spark.implicits._ ... val training = sc.textFile("hdfs:///ford/fordTrain.csv") val df = training.toDF // fixup your data to ensure your columns are the expected data type ... val lrModel = lr.fit(df) ...
Created 07-05-2017 10:12 PM
@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD.
You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints.
FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical.
import spark.implicits._ ... val training = sc.textFile("hdfs:///ford/fordTrain.csv") val df = training.toDF // fixup your data to ensure your columns are the expected data type ... val lrModel = lr.fit(df) ...
Created 07-06-2017 12:18 PM