Support Questions
Find answers, ask questions, and share your expertise

Hive Transactional Tables are not readable by Spark

Hi,

I am having the below problem when connecting from a Spark Program to Hive tables with Transaction = True.

When a select * is executed on these tables, only the table meta data (columns) are displayed but not the records.

Is this a limitation in Spark? Is there a way to get around this by setting Spark on a ready only mode on these tables?

Any suggestions welcome. Thanks in advance.

Note:

The main reason for enabling Transaction=True for hive tables was, the PutHiveStreaming Processor of Nifi expected the table to be ACID Compliant for it to work. Now we put the data into Hive, but Spark is not able to read it.

1 ACCEPTED SOLUTION

Expert Contributor

Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)

It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991

View solution in original post

6 REPLIES 6

Expert Contributor

Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)

It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991

Thanks for the response Eugene Koifman. Also is there a way Spark can access those tables only in read mode by specifying the mode (READ only) or specify transaction as false on Spark side.

Expert Contributor

Not generally. The data layout for transactional tables requires special logic to decide which directories to read and how to combine them correctly. Some data files may represent updates of previously written rows, for example. Also, if you are reading while something is writing to this table your read may fail (w/o the special logic) because it will try to read incomplete ORC files. Compaction may (again w/o the special logic) may make it look like your data is duplicated.

Makes sense. Thanks @Eugene Koifman

Explorer

@ekoifman we have tried with

val sparLLP="com.hortonworks.spark" % "spark-llap-assembly_2.11" % "1.1.3-2.1"

resolvers +="hortonWorks Group" at "http://repo.hortonworks.com/content/groups/public/"

with sample code ..

val table1="transactional_table"

val sparkConf = new SparkConf()

sparkConf.set("spark.sql.warehouse.dir",<<dir>>)

sparkConf.set("hive.exec.dynamic.partition", "true")

sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")

sparkConf.set("hive.enforce.bucketing", "true")

sparkConf.set("spark.sql.hive.llap", "true")

sparkConf.set("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST@R2.MADM.NET")

val spark = SparkSession.builder.appName("test").enableHiveSupport().config(sparkConf).getOrCreate()

val sqlContext: SQLContext = spark.sqlContext

val df = spark.sql(s"SELECT * FROM $table1")

We are getting following error -

"main" java.sql.SQLException: Could not open client transport for any of the Server URI's in ZooKeeper: Unable to read HiveServer2 uri from ZooKeeper

Whereas without "llap" we are able to run spark sql.

New Contributor

I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark. Maybe I can use this JDBC call from spark until we get the native ACID support from Spark.


Please refer the below link for code and steps to achieve the same.

https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID

Stackoverflow Answer - https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark/5623315...

Thanks

Gowtham

; ;