Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

spark scala Dataframe adding new column Error

avatar

Hi All,

Im trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error:

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [Column_12: double] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77) at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163) at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162) at org.apache.spark.sql.functions$.typedLit(functions.scala:112) at org.apache.spark.sql.functions$.lit(functions.scala:95) at MYDev.ReconTest$.main(ReconTest.scala:35) at MYDev.ReconTest.main(ReconTest.scala)

and the Query im using is:

var df = inputDf
df = df.persist()
inputDf = inputDf.withColumn("newColumn",
when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08",
lit(df.groupBy(df("tableName"),df("runDate"))
.agg(sum(when(df("business_date") > "2018-08-08", df("rowCount")))
.alias("finalSRCcount"))
.drop("tableName","runDate"))))

Cheers,

MJ

1 ACCEPTED SOLUTION

avatar

Hi Issue got resolved,

i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow.

  1. inputDf = inputDf.groupBy(col("tableName"),col("runDate"))
  2. .agg(sum(when(col("MinBusinessDate")< col("runDate")&& col("MaxBusinessDate")> col("runDate"),
  3. when(col("business_date")> col("runDate"), col("rowCount")))).alias("NewColumnName"))

View solution in original post

1 REPLY 1

avatar

Hi Issue got resolved,

i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow.

  1. inputDf = inputDf.groupBy(col("tableName"),col("runDate"))
  2. .agg(sum(when(col("MinBusinessDate")< col("runDate")&& col("MaxBusinessDate")> col("runDate"),
  3. when(col("business_date")> col("runDate"), col("rowCount")))).alias("NewColumnName"))