Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to fix the exception: java.math.BigDecimal is not a valid external type for schema of double while re-applying schema on a datadframe?

I am trying to move data from table: system_releases from Greenplum to Hive in the below manner:

<code>val yearDF = spark.read.format("jdbc").option("url","urltemplate;MaxNumericScale=30;MaxNumericPrecision=40;").option("dbtable", s"(${execQuery}) as year2016").option("user","user").option("password","pwd").option("partitionColumn","release_number").option("lowerBound",306).option("upperBound",500).option("numPartitions",2).load()<br>

Inferred Schema of the dataFrame yearDF by spark:

<code>description:string
status_date:timestamp
time_zone:string
table_refresh_delay_min:decimal(38,30)
online_patching_enabled_flag:string
release_number:decimal(38,30)
change_number:decimal(38,30)
interface_queue_enabled_flag:string
rework_enabled_flag:string
smart_transfer_enabled_flag:string
patch_number:decimal(38,30)
threading_enabled_flag:string
drm_gl_source_name:string
reverted_flag:string
table_refresh_delay_min_text:string
release_number_text:string
change_number_text:string<br><br>

I have the same table on hive with following datatypes:

<code>val hiveCols=string,status_date:timestamp,time_zone:string,table_refresh_delay_min:double,online_patching_enabled_flag:string,release_number:double,change_number:double,interface_queue_enabled_flag:string,rework_enabled_flag:string,smart_transfer_enabled_flag:string,patch_number:double,threading_enabled_flag:string,drm_gl_source_name:string,reverted_flag:string,table_refresh_delay_min_text:string,release_number_text:string,change_number_text:string<br><br>

The columns: table_refresh_delay_min, release_number, change_number and patch_number are giving too many decimal points even though there aren't many in GP. So I tried to save it as a CSV file to take a look at how data is being read by spark. For example, the max number of release_number on GP is: 306.00 but in the csv file I saved the dataframe: yearDF, the value becomes 306.000000000000000000.

I tried to take the hive table schema and converted to StructType to apply that on yearDF as below.

<code>def convertDatatype(datatype:String):DataType={val convert = datatype match{case"string"=>StringTypecase"bigint"=>LongTypecase"int"=>IntegerTypecase"double"=>DoubleTypecase"date"=>TimestampTypecase"boolean"=>BooleanTypecase"timestamp"=>TimestampType}
  convert
}val schemaList        = hiveCols.split(",")val schemaStructType  =newStructType(schemaList.map(col => col.split(":")).map(e =>StructField(e(0), convertDatatype(e(1)),true)))val newDF = spark.createDataFrame(yearDF.rdd, schemaStructType)
newDF.write.format("csv").save("hdfs/location")<br><br>

But I am getting the error:

<code>Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external typefor schema of double
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr8$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(UnknownSource)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)...17 more<br><br>

I tried to cast the decimal columns into DoubleType in the below manner but I still face the same exception.

<code>val pattern ="""DecimalType\(\d+,(\d+)\)""".r
  val df2 = dataDF.dtypes.
    collect{case(dn, dt)if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0")!="0"=> dn }.
    foldLeft(dataDF)((accDF, c)=> accDF.withColumn(c, col(c).cast("Double")))Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external typefor schema of double
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr8$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(UnknownSource)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(UnknownSource)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)...17 more<br><br>

I am out of ideas after trying to implement the above two ways. Could anyone let me know how can I cast the columns of a dataframe properly to the required datatypes ?

2 REPLIES 2

New Contributor

Hi Sidhartha,

Were you able to fix the issue? If so what is the fix?

 

I'm facing below issue, 

error: java.lang.NoSuchMethodError: org.apache.hadoop.hive.common.type.HiveDecimal.enforcePrecisionScale(Ljava/math/BigDecimal;II)Ljava/math/BigDecimal
 
Can anyone help, please?
Thank You,
Amudhan Karunanidhi

Community Manager

Welcome to the community @AtomJunkYarder . As this is an older post you would have a better chance of receiving a resolution by starting a new thread. This will also provide the opportunity to provide details specific to your environment that could aid others in providing a more accurate answer to your question. 


Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.