Created 11-08-2016 10:46 PM
HI Experts,
I am using Spark 2.0.0 and I have an airline dataset. I created a SparkR dataframe and able to run some of the functions of SparkR Dataframe API. But, I am running through some exceptions while building the linear model using Gaussian family. Here is my command:
model <- glm(train_data, ARR_DELAY ~ MONTH + DEP_HOUR + DEP_DELAY + WEEKEND + DISTANCE, family = "gaussian")
ERROR:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name "formula" among (YEAR, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, CARRIER, FL_NUM, ORIGIN, DEST, DEP_TIME, DEP_DELAY, ARR_TIME, ARR_DELAY, CANCELLED, CANCELLATION_CODE, AIR_TIME, DISTANCE, WEEKEND, DEP_HOUR, DELAY_LABELED);
For some reason, it tries to fetch formula column, so I replaced above command with:
model <- glm(train_data, formula = ARR_DELAY ~ MONTH + DEP_HOUR + DEP_DELAY + WEEKEND + DISTANCE, family = "gaussian")
This time, I got this error:
ERROR Executor: Exception in task 0.0 in stage 45.0 (TID 95)
scala.MatchError: [null,1.0,[1.0,11.0,5.0,1.0,2475.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Has anyone seen such kind of behaviour? Thanks in advance
Created 11-12-2016 01:14 AM
Got this one working, there were some null values in output/dependent variable ARR_DELAY. Replaced those with the mean value of the column.
Created 11-11-2016 08:37 PM
it would be the great help if someone replies to this thread, kind of stuck here. Thanks
Created 11-12-2016 01:14 AM
Got this one working, there were some null values in output/dependent variable ARR_DELAY. Replaced those with the mean value of the column.
Created 04-25-2018 04:47 AM
You could also drop null values from your initial columns with:
train_df <- dropna(train_df,cols = 'ARR_DELAY')