Created 11-19-2015 03:03 AM
I'm currently using Spark 1.4 and I'm loading some data into a DataFrame using jdbc:
val jdbcDF = sqlContext.load("jdbc", options)
How can I save the jdbcDF DataFrame to a Hive table using the ORC file format?
Created 11-19-2015 03:07 AM
df.write.format("orc") will get you there.
See: http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ or http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_orc-spark.html
Created 11-19-2015 03:07 AM
df.write.format("orc") will get you there.
See: http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ or http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_orc-spark.html
Created 11-19-2015 03:15 AM
Thanks for the helpful links! Should I create the Hive table ahead of time or could I do everything within spark?
Created 11-19-2015 03:52 AM
@Kit Menke You can create within spark. Please see this https://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
Created 12-12-2015 01:33 AM
If you want to access your table from hive, you have two options:
1- create table ahead and use df.write.fromat("orc")
2- use Brandon's suggestion here, register df as temp_table and do create table as select from temp_table.
See code examples here:
If you use saveAsTable function, it will create a table in hive metastore, but hive wont be able to query it. Only spark can use the table with this method.
Created 11-19-2015 03:24 AM
You can just write out the DF as ORC and the underlying directory will be created. LMK, if this doesn't work.
Created 11-19-2015 03:41 PM
Yep, the ORC directory is created but a Hive table is not.
Created 12-17-2015 01:56 AM
I am also facing the same issue .. I saved the data in orc format from DF and created external hive table ..when I do show tables in hive context in spark it shows me the table but I couldnt see any table in my hive warehouse so when I query the hive external table. when I just create the hive table(no df no data processing ) using hivecontext table get created and able to query also .Unable to understand this strange behaviour . Am I misisng something ?
for ex : hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (name STRING, age STRING)")
shows me the table in hive also.
Created 11-19-2015 03:41 AM
The way I have done this is to first register a temp table in Spark and then leverage the sql method of the HiveContext to create a new table in hive using the data from the temp table. For example if I have a dataframe df and HiveContext hc the general process is:
df.registerTempTable("my_temp_table") hc.sql("CREATE TABLE new_table_name STORED AS ORC AS SELECT * from my_temp_table")
Created 11-19-2015 03:39 PM
Very interesting! I will try this out!