Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Error in building the Generalized Linear Model in SparkR

avatar
Super Collaborator

HI Experts,

I am using Spark 2.0.0 and I have an airline dataset. I created a SparkR dataframe and able to run some of the functions of SparkR Dataframe API. But, I am running through some exceptions while building the linear model using Gaussian family. Here is my command:

model <- glm(train_data, ARR_DELAY ~ MONTH + DEP_HOUR + DEP_DELAY + WEEKEND + DISTANCE, family = "gaussian")

ERROR:

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

org.apache.spark.sql.AnalysisException: Cannot resolve column name "formula" among (YEAR, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, CARRIER, FL_NUM, ORIGIN, DEST, DEP_TIME, DEP_DELAY, ARR_TIME, ARR_DELAY, CANCELLED, CANCELLATION_CODE, AIR_TIME, DISTANCE, WEEKEND, DEP_HOUR, DELAY_LABELED);

For some reason, it tries to fetch formula column, so I replaced above command with:

model <- glm(train_data, formula = ARR_DELAY ~ MONTH + DEP_HOUR + DEP_DELAY + WEEKEND + DISTANCE, family = "gaussian")

This time, I got this error:

ERROR Executor: Exception in task 0.0 in stage 45.0 (TID 95)

scala.MatchError: [null,1.0,[1.0,11.0,5.0,1.0,2475.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

Has anyone seen such kind of behaviour? Thanks in advance

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Got this one working, there were some null values in output/dependent variable ARR_DELAY. Replaced those with the mean value of the column.

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

it would be the great help if someone replies to this thread, kind of stuck here. Thanks

avatar
Super Collaborator

Got this one working, there were some null values in output/dependent variable ARR_DELAY. Replaced those with the mean value of the column.

avatar
Contributor

You could also drop null values from your initial columns with:

train_df <- dropna(train_df,cols = 'ARR_DELAY')