- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive Transactional Tables are not readable by Spark
- Labels:
-
Apache Hive
-
Apache Spark
Created ‎12-29-2017 09:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am having the below problem when connecting from a Spark Program to Hive tables with Transaction = True.
When a select * is executed on these tables, only the table meta data (columns) are displayed but not the records.
Is this a limitation in Spark? Is there a way to get around this by setting Spark on a ready only mode on these tables?
Any suggestions welcome. Thanks in advance.
Note:
The main reason for enabling Transaction=True for hive tables was, the PutHiveStreaming Processor of Nifi expected the table to be ACID Compliant for it to work. Now we put the data into Hive, but Spark is not able to read it.
Created ‎01-03-2018 12:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)
It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991
Created ‎01-03-2018 12:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)
It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991
Created ‎01-04-2018 05:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the response Eugene Koifman. Also is there a way Spark can access those tables only in read mode by specifying the mode (READ only) or specify transaction as false on Spark side.
Created ‎01-04-2018 05:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not generally. The data layout for transactional tables requires special logic to decide which directories to read and how to combine them correctly. Some data files may represent updates of previously written rows, for example. Also, if you are reading while something is writing to this table your read may fail (w/o the special logic) because it will try to read incomplete ORC files. Compaction may (again w/o the special logic) may make it look like your data is duplicated.
Created ‎01-04-2018 07:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Makes sense. Thanks @Eugene Koifman
Created ‎03-20-2018 09:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ekoifman we have tried with
val sparLLP="com.hortonworks.spark" % "spark-llap-assembly_2.11" % "1.1.3-2.1"
resolvers +="hortonWorks Group" at "http://repo.hortonworks.com/content/groups/public/"
with sample code ..
val table1="transactional_table"
val sparkConf = new SparkConf()
sparkConf.set("spark.sql.warehouse.dir",<<dir>>)
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")
sparkConf.set("hive.enforce.bucketing", "true")
sparkConf.set("spark.sql.hive.llap", "true")
sparkConf.set("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST@R2.MADM.NET")
val spark = SparkSession.builder.appName("test").enableHiveSupport().config(sparkConf).getOrCreate()
val sqlContext: SQLContext = spark.sqlContext
val df = spark.sql(s"SELECT * FROM $table1")
We are getting following error -
"main" java.sql.SQLException: Could not open client transport for any of the Server URI's in ZooKeeper: Unable to read HiveServer2 uri from ZooKeeper
Whereas without "llap" we are able to run spark sql.
Created ‎05-21-2019 01:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark. Maybe I can use this JDBC call from spark until we get the native ACID support from Spark.
Please refer the below link for code and steps to achieve the same.
https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID
Stackoverflow Answer - https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark/5623315...
Thanks
Gowtham
