Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Error in building the Generalized Linear Model in SparkR

Solved Go to solution
Highlighted

Error in building the Generalized Linear Model in SparkR

Super Collaborator

HI Experts,

I am using Spark 2.0.0 and I have an airline dataset. I created a SparkR dataframe and able to run some of the functions of SparkR Dataframe API. But, I am running through some exceptions while building the linear model using Gaussian family. Here is my command:

model <- glm(train_data, ARR_DELAY ~ MONTH + DEP_HOUR + DEP_DELAY + WEEKEND + DISTANCE, family = "gaussian")

ERROR:

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

org.apache.spark.sql.AnalysisException: Cannot resolve column name "formula" among (YEAR, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, CARRIER, FL_NUM, ORIGIN, DEST, DEP_TIME, DEP_DELAY, ARR_TIME, ARR_DELAY, CANCELLED, CANCELLATION_CODE, AIR_TIME, DISTANCE, WEEKEND, DEP_HOUR, DELAY_LABELED);

For some reason, it tries to fetch formula column, so I replaced above command with:

model <- glm(train_data, formula = ARR_DELAY ~ MONTH + DEP_HOUR + DEP_DELAY + WEEKEND + DISTANCE, family = "gaussian")

This time, I got this error:

ERROR Executor: Exception in task 0.0 in stage 45.0 (TID 95)

scala.MatchError: [null,1.0,[1.0,11.0,5.0,1.0,2475.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

Has anyone seen such kind of behaviour? Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Error in building the Generalized Linear Model in SparkR

Super Collaborator

Got this one working, there were some null values in output/dependent variable ARR_DELAY. Replaced those with the mean value of the column.

3 REPLIES 3

Re: Error in building the Generalized Linear Model in SparkR

Super Collaborator

it would be the great help if someone replies to this thread, kind of stuck here. Thanks

Re: Error in building the Generalized Linear Model in SparkR

Super Collaborator

Got this one working, there were some null values in output/dependent variable ARR_DELAY. Replaced those with the mean value of the column.

Re: Error in building the Generalized Linear Model in SparkR

New Contributor

You could also drop null values from your initial columns with:

train_df <- dropna(train_df,cols = 'ARR_DELAY')