- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark 2 beta load or save Hive managed table
- Labels:
-
Apache Hive
-
Apache Spark
Created on 11-15-2016 10:39 AM - edited 09-16-2022 03:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, when I try to list tables in Hive, it shows nothing.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark2 Hive Example").config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera/user/hive/warehouse").enableHiveSupport().getOrCreate()
spark.catalog.listTables("default").show()
I'm using CDH-5.9.0-1.cdh5.9.0.p0.23 and SPARK2-2.0.0.cloudera.beta2-1.cdh5.7.0.p0.110234.
Could anyone show me how to load and save dataframe from / into Hive?
Thanks.
Created 11-15-2016 06:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm so stupid...
There's a Hive Service configuration item in Spark 2.0.0 beta2...
Just check this is enable to the correct Hive Service in CDH.
Created 11-15-2016 02:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are only concerned with static DataFrames (and not steaming), this is pretty straight forward programmatically.
To create the DataFrame from a Hive table with example query:
df = spark.sql(SELECT * FROM table_name1)
To save a DataFrame back to a Hive table:
df.write.saveAsTable('table_name2',format='parquet',mode='overwrite')
Now, you may want to try listing databases instead of tables. Listing tables will only list the tables associated with your current database. The default database is likely empty if you're just starting out.
My struggle is in Spark Streaming with version 2.0.0.cloudera.beta1, where the saveAsTable method is not available for a streaming DataFrame. That makes it all a bit trickier compared to the static dataframe read/write.
Created 11-15-2016 05:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, Brian.
I think the problem is about cannot connect to Hive metastore.
The default database is not empty.
If I use spark-shell in Spark 1.6.0 (/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark):
scala> sqlContext.sql("show tables").show()
scala> sys.env("HADOOP_CONF_DIR")
res1: String = /usr/lib/spark/conf/yarn-conf:/etc/hive/conf:/etc/hive/conf
It works well and prints all the table managed by Hive.
However, in Spark 2.0.0 (SPARK2-2.0.0.cloudera.beta1-1.cdh5.7.0.p0.108015):
scala> sys.env("HADOOP_CONF_DIR")
res0: String = /opt/cloudera/parcels/SPARK2-2.0.0.cloudera.beta2-1.cdh5.7.0.p0.110234/lib/spark2/conf/yarn-conf
There's no hive related conf dir in $HADOOP_CONF_DIR
By the way, in Spark 2.0.0
scala> val df = spark.read.parquet("/user/hive/warehouse/test_db.db/test_table_pqt")
scala> df.show(5)
This works with pre-managed Hive table in Spark 1.6.0.
Created 11-15-2016 05:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
might be something wrong with Hive jar dependencies.
Created 11-15-2016 06:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm so stupid...
There's a Hive Service configuration item in Spark 2.0.0 beta2...
Just check this is enable to the correct Hive Service in CDH.
Created 05-11-2017 04:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ran into the same problem, resolved by enabling 'Hive Service' in Spark2.
