Support Questions

r_young · ‎07-05-2017

Hi, i have been following some online examples in trying to build a model. I am using a csv data set, Below is a snippet of the headings and some of the data:

TrialID ObsNum IsAlert P1 P2 P3 P4 P5 P6 P7 P8 E1 E2

0 0 138.4294 10.9435 1000 60 0.302277 508 118.11 0 0 0

0 1 138.3609 15.3212 1000 600.302277 508 118.11 0 0 0

The third column, IsAlert is the ground truth

This is the code i have been trying, amongst some others.

val training = sc.textFile("hdfs:///ford/fordTrain.csv")  

val header = training.first  

val inferSchema = true   

val lr = new LogisticRegression()  .setMaxIter(10)  .setRegParam(0.3)  .setElasticNetParam(0.8)

// Fit the model

val lrModel = lr.fit(training)

// Print the coefficients and intercept for logistic regression

println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

// We can also use the multinomial family for binary classificationval 

mlr = new LogisticRegression()  .setMaxIter(10)  .setRegParam(0.3)  .setElasticNetParam(0.8)  .setFamily("multinomial")

val mlrModel = mlr.fit(training)

// Print the coefficients and intercepts for logistic regression with multinomial family

println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")

println(s"Multinomial intercepts: ${mlrModel.interceptVector}")



This is the error i am recieving
import org.apache.spark.sql.types.{StructType, StructField, StringType}
training: org.apache.spark.rdd.RDD[String] = hdfs:///ford/fordTrain.csv MapPartitionsRDD[7] at textFile at <console>:188
header: String = TrialID,ObsNum,IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
inferSchema: Boolean = true
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_1049bed7e9a0
<console>:192: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.sql.DataFrame
         val lrModel = lr.fit(training)
                              ^
I would be grateful for any help, thank you

jfrazee · ‎07-05-2017

@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD.

You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints.

FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical.

import spark.implicits._

...

val training = sc.textFile("hdfs:///ford/fordTrain.csv")

val df = training.toDF

// fixup your data to ensure your columns are the expected data type

...

val lrModel = lr.fit(df)

...

View solution in original post

jfrazee · ‎07-05-2017

@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD.

You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints.

FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical.

import spark.implicits._

...

val training = sc.textFile("hdfs:///ford/fordTrain.csv")

val df = training.toDF

// fixup your data to ensure your columns are the expected data type

...

val lrModel = lr.fit(df)

...

r_young · ‎07-06-2017

Hi @jfrazee. Thank you for the help. I will give it a go.

Cloudera Community

Support Questions

SparkML error type mismatch