Created 12-29-2017 09:48 PM
Hi,
I am having the below problem when connecting from a Spark Program to Hive tables with Transaction = True.
When a select * is executed on these tables, only the table meta data (columns) are displayed but not the records.
Is this a limitation in Spark? Is there a way to get around this by setting Spark on a ready only mode on these tables?
Any suggestions welcome. Thanks in advance.
Note:
The main reason for enabling Transaction=True for hive tables was, the PutHiveStreaming Processor of Nifi expected the table to be ACID Compliant for it to work. Now we put the data into Hive, but Spark is not able to read it.
Created 01-03-2018 12:31 AM
Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)
It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991
Created 01-03-2018 12:31 AM
Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)
It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991
Created 01-04-2018 05:06 PM
Thanks for the response Eugene Koifman. Also is there a way Spark can access those tables only in read mode by specifying the mode (READ only) or specify transaction as false on Spark side.
Created 01-04-2018 05:57 PM
Not generally. The data layout for transactional tables requires special logic to decide which directories to read and how to combine them correctly. Some data files may represent updates of previously written rows, for example. Also, if you are reading while something is writing to this table your read may fail (w/o the special logic) because it will try to read incomplete ORC files. Compaction may (again w/o the special logic) may make it look like your data is duplicated.
Created 01-04-2018 07:02 PM
Makes sense. Thanks @Eugene Koifman
Created 03-20-2018 09:17 PM
@ekoifman we have tried with
val sparLLP="com.hortonworks.spark" % "spark-llap-assembly_2.11" % "1.1.3-2.1"
resolvers +="hortonWorks Group" at "http://repo.hortonworks.com/content/groups/public/"
with sample code ..
val table1="transactional_table"
val sparkConf = new SparkConf()
sparkConf.set("spark.sql.warehouse.dir",<<dir>>)
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")
sparkConf.set("hive.enforce.bucketing", "true")
sparkConf.set("spark.sql.hive.llap", "true")
sparkConf.set("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST@R2.MADM.NET")
val spark = SparkSession.builder.appName("test").enableHiveSupport().config(sparkConf).getOrCreate()
val sqlContext: SQLContext = spark.sqlContext
val df = spark.sql(s"SELECT * FROM $table1")
We are getting following error -
"main" java.sql.SQLException: Could not open client transport for any of the Server URI's in ZooKeeper: Unable to read HiveServer2 uri from ZooKeeper
Whereas without "llap" we are able to run spark sql.
Created 05-21-2019 01:45 PM
I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark. Maybe I can use this JDBC call from spark until we get the native ACID support from Spark.
Please refer the below link for code and steps to achieve the same.
https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID
Stackoverflow Answer - https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark/5623315...
Thanks
Gowtham