Support Questions

manigandaprakas · ‎01-14-2018

Versions:

HDP-2.6.1

Hive 1.2.1000.2.6.1.0-129

Spark-2.1.1

Python 2.7.13

This is an issue only on a transactional hive table.

In HDFS, for a transactional hive table, data file is created under a delta directory as shown below

/user/acid_table/load_date=2018-01-14/delta_0018772_0018772_0000/bucket_00000

NumberFormatException thrown on delta directory.

Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0018773_0000"
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
.....
INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
Traceback (most recent call last):
  File "/home/../ex.py", line 24, in <module>
    sc1.sql("select * from default.acid_table").toPandas()
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1585, in toPandas
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 391, in collect
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.collectToPython.
: java.lang.RuntimeException: serious problem
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

Code:

hiveContext = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveContext.sql("select * from default.acid_table").toPandas()

Everything works fine when '0000' suffix is removed from the delta directory.

Please suggest.

xin_wang · ‎03-07-2018

According to https://issues.apache.org/jira/browse/SPARK-15348, spark now is not support transactional hive table.

View solution in original post

xin_wang · ‎03-07-2018

According to https://issues.apache.org/jira/browse/SPARK-15348, spark now is not support transactional hive table.

TimothySpann · ‎03-08-2018

You will have to wait for the next release of HDP for Spark to Support Hive ACID tables.

mvince · ‎09-19-2018

Hi @Timothy Spann

So this feature is now supported in HDP 3.0?

Cloudera Community

Support Questions

pyspark + SparkSql + transactional orc table throws NumberFormatException

How to Create an Iceberg Table with PySpark in Clo...

Optimizing Hive queries for ORC formatted tables

Spark RDDs vs DataFrames vs SparkSQL

ORC Table Timestamp PySpark 2.1 CASTIssue

ORC Creation Best Practices

SparkSQL jdbc Federation

Hive table compression: bz2 vs Text vs Orc vs Parq...

How to change column Type in SparkSQL?

Spark 2.1 Hive ORC saveAsTable pyspark

Import RDBMS into Hive table stored as ORC with SQ...