Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to fix the exception: java.math.BigDecimal is not a valid external type for schema of double while re-applying schema on a datadframe?

Highlighted

How to fix the exception: java.math.BigDecimal is not a valid external type for schema of double while re-applying schema on a datadframe?

New Contributor

I am trying to move data from table: system_releases from Greenplum to Hive in the below manner:

<code>val yearDF = spark.read.format("jdbc").option("url","urltemplate;MaxNumericScale=30;MaxNumericPrecision=40;").option("dbtable", s"(${execQuery}) as year2016").option("user","user").option("password","pwd").option("partitionColumn","release_number").option("lowerBound",306).option("upperBound",500).option("numPartitions",2).load()<br>

Inferred Schema of the dataFrame yearDF by spark:

<code>description:string
status_date:timestamp
time_zone:string
table_refresh_delay_min:decimal(38,30)
online_patching_enabled_flag:string
release_number:decimal(38,30)
change_number:decimal(38,30)
interface_queue_enabled_flag:string
rework_enabled_flag:string
smart_transfer_enabled_flag:string
patch_number:decimal(38,30)
threading_enabled_flag:string
drm_gl_source_name:string
reverted_flag:string
table_refresh_delay_min_text:string
release_number_text:string
change_number_text:string<br><br>

I have the same table on hive with following datatypes:

<code>val hiveCols=string,status_date:timestamp,time_zone:string,table_refresh_delay_min:double,online_patching_enabled_flag:string,release_number:double,change_number:double,interface_queue_enabled_flag:string,rework_enabled_flag:string,smart_transfer_enabled_flag:string,patch_number:double,threading_enabled_flag:string,drm_gl_source_name:string,reverted_flag:string,table_refresh_delay_min_text:string,release_number_text:string,change_number_text:string<br><br>

The columns: table_refresh_delay_min, release_number, change_number and patch_number are giving too many decimal points even though there aren't many in GP. So I tried to save it as a CSV file to take a look at how data is being read by spark. For example, the max number of release_number on GP is: 306.00 but in the csv file I saved the dataframe: yearDF, the value becomes 306.000000000000000000.

I tried to take the hive table schema and converted to StructType to apply that on yearDF as below.

<code>def convertDatatype(datatype:String):DataType={val convert = datatype match{case"string"=>StringTypecase"bigint"=>LongTypecase"int"=>IntegerTypecase"double"=>DoubleTypecase"date"=>TimestampTypecase"boolean"=>BooleanTypecase"timestamp"=>TimestampType}
  convert
}val schemaList        = hiveCols.split(",")val schemaStructType  =newStructType(schemaList.map(col => col.split(":")).map(e =>StructField(e(0), convertDatatype(e(1)),true)))val newDF = spark.createDataFrame(yearDF.rdd, schemaStructType)
newDF.write.format("csv").save("hdfs/location")<br><br>

But I am getting the error:

<code>Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external typefor schema of double
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr8$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(UnknownSource)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)...17 more<br><br>

I tried to cast the decimal columns into DoubleType in the below manner but I still face the same exception.

<code>val pattern ="""DecimalType\(\d+,(\d+)\)""".r
  val df2 = dataDF.dtypes.
    collect{case(dn, dt)if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0")!="0"=> dn }.
    foldLeft(dataDF)((accDF, c)=> accDF.withColumn(c, col(c).cast("Double")))Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external typefor schema of double
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr8$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(UnknownSource)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)...17 more<br><br>

I am out of ideas after trying to implement the above two ways. Could anyone let me know how can I cast the columns of a dataframe properly to the required datatypes ?