Support Questions

Find answers, ask questions, and share your expertise

pyspark + SparkSql + transactional orc table throws NumberFormatException

avatar
New Contributor

Versions:

HDP-2.6.1

Hive 1.2.1000.2.6.1.0-129

Spark-2.1.1

Python 2.7.13

This is an issue only on a transactional hive table.

In HDFS, for a transactional hive table, data file is created under a delta directory as shown below

/user/acid_table/load_date=2018-01-14/delta_0018772_0018772_0000/bucket_00000

NumberFormatException thrown on delta directory.

Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0018773_0000"
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
.....
INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
Traceback (most recent call last):
  File "/home/../ex.py", line 24, in <module>
    sc1.sql("select * from default.acid_table").toPandas()
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1585, in toPandas
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 391, in collect
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.collectToPython.
: java.lang.RuntimeException: serious problem
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

Code:

hiveContext = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveContext.sql("select * from default.acid_table").toPandas()

Everything works fine when '0000' suffix is removed from the delta directory.

Please suggest.

1 ACCEPTED SOLUTION

avatar
Contributor

According to https://issues.apache.org/jira/browse/SPARK-15348, spark now is not support transactional hive table.

View solution in original post

3 REPLIES 3

avatar
Contributor

According to https://issues.apache.org/jira/browse/SPARK-15348, spark now is not support transactional hive table.

avatar
Master Guru

You will have to wait for the next release of HDP for Spark to Support Hive ACID tables.

avatar
Explorer

Hi @Timothy Spann

So this feature is now supported in HDP 3.0?