Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to fix the exception while applying a regex_replace on a dataframe in Scala?

Highlighted

How to fix the exception while applying a regex_replace on a dataframe in Scala?

Contributor

I have a dataframe created from reading an RDBMS table as below:

val dataDF = spark.read.format("jdbc").option("url", connectionUrl)
                                                .option("dbtable", s"(${query}) as year2017")
                                                .option("user", devUserName)
                                                .option("password", devPassword)
                                                .option("numPartitions",15)
                                                .load()

Before ingesting the data into the Hive tables on HDFS, we were asked to apply a regex_replace pattern on the columns of the dataframe that are of String datatype. This is how I applied it:

val regExpr = dataDF.schema.fields.map { x =>
  if (x.dataType == StringType)
  "regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(%s, E'[\\\\n]+', ' ', 'g' ), E'[\\\\r]+', ' ', 'g' ), E'[\\\\t]+', ' ', 'g' ), E'[\\\\cA]+', ' ', 'g' ), E'[\\\\ca]+', ' ', 'g' ) as %s".format(x.name, x.name)
  else
    x.name
}
dataDF.selectExpr(regExpr:_*)

But when I execute the code, it ends with following exception:

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
Literals of type 'E' are currently not supported.(line 1, pos 88)

== SQL ==
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(period_name, E'[\\n]+', ' ', 'g' ), E'[\\r]+', ' ', 'g' ), E'[\\t]+', ' ', 'g' ), E'[\\cA]+', ' ', 'g' ), E'[\\ca]+', ' ', 'g' ) as period_name
----------------------------------------------------------------------------------------^^^

I printed the schema using: println(dataDF.schema). The code is identifying the String column correctly where you can see the column name:

period_name
Schema: StructType(StructField(forecast_id,LongType,true), StructField(period_num,DecimalType(15,0),true), StructField(period_name,StringType,true), StructField(drm_org,StringType,true), StructField(ledger_id,LongType,true), StructField(currency_code,StringType,true), 

The table read is present on postgres database. I tried to understand why is E causing the exception but I am not able to get a clue. Could anyone let me know how can I fix this exception ?