Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How do I create an ORC Hive table from Spark?

avatar
Expert Contributor

I'm currently using Spark 1.4 and I'm loading some data into a DataFrame using jdbc:

val jdbcDF = sqlContext.load("jdbc", options)

How can I save the jdbcDF DataFrame to a Hive table using the ORC file format?

1 ACCEPTED SOLUTION
12 REPLIES 12

avatar
Expert Contributor

Thanks for the helpful links! Should I create the Hive table ahead of time or could I do everything within spark?

avatar
Master Mentor

avatar

@Kit Menke

If you want to access your table from hive, you have two options:

1- create table ahead and use df.write.fromat("orc")

2- use Brandon's suggestion here, register df as temp_table and do create table as select from temp_table.

See code examples here:

https://community.hortonworks.com/questions/6023/orgapachesparksparkexception-task-failed-while-wri....

If you use saveAsTable function, it will create a table in hive metastore, but hive wont be able to query it. Only spark can use the table with this method.

avatar

You can just write out the DF as ORC and the underlying directory will be created. LMK, if this doesn't work.

avatar
Expert Contributor

Yep, the ORC directory is created but a Hive table is not.

avatar
Expert Contributor
@vshukla

I am also facing the same issue .. I saved the data in orc format from DF and created external hive table ..when I do show tables in hive context in spark it shows me the table but I couldnt see any table in my hive warehouse so when I query the hive external table. when I just create the hive table(no df no data processing ) using hivecontext table get created and able to query also .Unable to understand this strange behaviour . Am I misisng something ?

for ex : hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (name STRING, age STRING)")

shows me the table in hive also.

avatar

The way I have done this is to first register a temp table in Spark and then leverage the sql method of the HiveContext to create a new table in hive using the data from the temp table. For example if I have a dataframe df and HiveContext hc the general process is:

df.registerTempTable("my_temp_table")
hc.sql("CREATE TABLE new_table_name STORED AS ORC  AS SELECT * from my_temp_table")

avatar
Expert Contributor

Very interesting! I will try this out!