Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

pyspark + SparkSql + transactional orc table throws NumberFormatException

Solved Go to solution
Highlighted

pyspark + SparkSql + transactional orc table throws NumberFormatException

New Contributor

Versions:

HDP-2.6.1

Hive 1.2.1000.2.6.1.0-129

Spark-2.1.1

Python 2.7.13

This is an issue only on a transactional hive table.

In HDFS, for a transactional hive table, data file is created under a delta directory as shown below

/user/acid_table/load_date=2018-01-14/delta_0018772_0018772_0000/bucket_00000

NumberFormatException thrown on delta directory.

Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0018773_0000"
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
.....
INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
Traceback (most recent call last):
  File "/home/../ex.py", line 24, in <module>
    sc1.sql("select * from default.acid_table").toPandas()
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1585, in toPandas
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 391, in collect
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.collectToPython.
: java.lang.RuntimeException: serious problem
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

Code:

hiveContext = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveContext.sql("select * from default.acid_table").toPandas()

Everything works fine when '0000' suffix is removed from the delta directory.

Please suggest.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: pyspark + SparkSql + transactional orc table throws NumberFormatException

New Contributor

According to https://issues.apache.org/jira/browse/SPARK-15348, spark now is not support transactional hive table.

3 REPLIES 3

Re: pyspark + SparkSql + transactional orc table throws NumberFormatException

New Contributor

According to https://issues.apache.org/jira/browse/SPARK-15348, spark now is not support transactional hive table.

Re: pyspark + SparkSql + transactional orc table throws NumberFormatException

Super Guru

You will have to wait for the next release of HDP for Spark to Support Hive ACID tables.

Re: pyspark + SparkSql + transactional orc table throws NumberFormatException

New Contributor

Hi @Timothy Spann

So this feature is now supported in HDP 3.0?

Don't have an account?
Coming from Hortonworks? Activate your account here