It is not clear how the 1st step of creating the hive table via hiveContext is related to the later step of saving the "results" into that table. It seems it is merely being saved to a file in hive's warehouse directory but how will hive know that this file is for the previous table. Looks like you missed a step!
it seems to me, that tutorial has some errors (tagging @Robert Hryniewicz to notify him about errors in tutorial). See this:
1. At first, in section "Creating ORC tables", we create table named "yahoo_orc_table", which is empty (but specified to be stored as ORC)
2. Then we load data, specify the schema in section "Creating a schema" using Scalas case class
3. In section "Registering a temporary table" we register a temp table "yahoo_stocks_temp" - name different from one in point (1) (otherwise SparkSQL would complain). This table holds our data - we have loaded the data and specified schema.
4. Then we query temp table - this is still not stored in HDFS, nor Hive Metastore knows about it.
5. In section "Saving as an ORC file" we store the "results" DataFrame, which is same as "stockprice" DataFrame, as "results" has came into our runtime namespace by selecting everything from temp table "yahoo_stocks_temp", which is only a registration of "stockprice" DataFrame. (Really confusing tutorial for newbies...). And we save thing "results" with command:
by which we create a /user/our_id/yahoo_stocks_orc directory in HDFS. Yes, it is ORC, but Hive does not know about this table at all. We just dumped the data to HDFS.
Then, there a notice:
To store results in a hive directory rather than user directory, use this path instead: /apps/hive/warehouse/yahoo_stocks_orc
Doing command as supposed in notice
We get our ORC data inside /apps/hive/warehouse/yahoo_stocks_orc.
But that is only dump of data. Hive Metastore has no metadata regarding the table. Try shutting down the SparkSQL and use whatever method to execute "show tables" on Hive. Our yahoo_stocks_orc is nowhere to be found.
But Hive knows about yahoo_orc_table we created earlier in (1), but thats an empty table.
6. In section "Reading the ORC file" - we read the data from HDFS, register them as temp table and run queries against it. But again, anybody not using SparkSQL this way, does not know about the table. Hive Metastore does not store information on this table. We have not integrated Hive and SparkSQL, we only stored the table sa ORC into HDFS.
That is the main error of tutorial. To completely integrate Hive and SparkSQL, we should have notified Hive Metastore about table existence in warehouse directory.
I know only about two ways how to do it:
A. store the DataFrame using results.saveAsTable("yahoo_stocks_parquet"). This will properly create subdirectory in /apps/hive/warehouse named yahoo_stocks_parquet. Unfortunately, saveAsTable (deprecated!) uses Parquet by default. I am unable to find how to change this behavior to use ORC format.
B. store the DataFrame into pre-exising Hive warehouse directory. First, one has to create table as in (1), then go thorough the tutorial and finally before writing ORC do this:
hdfs dfs -rm -r /apps/warehouse/yahoo_orc_table
And then store our "results" dataframe into "yahoo_orc_table" (this is the name of table from (1) - pre existing one in Hive Metastore):
This way, Hive will retain metadata (schema) created in (1) and load the data from HDFS located in
Which are files created using our SparkSQL DataFrameWriter (ORC).
As I found C to be cumbersome to specify the schema twice (once in CREATE TABLE statement in Hive and second time using case class in SparkSQL/Scala), I have developed third way (not part of this tutorial):
C. Create your dataframe results, run:
results.limit(1).saveAsTable("yahoo_orc_table") hdfs dfs -rm -r /apps/hive/warehouse/yahoo_orc_table results.write.format("orc").save("/apps/hive/warehouse/yahoo_orc_table")
This will first save only 1 row as a table yahoo_orc_table - notifying Hive Metastore of table existence, then we remove the data from HDFS and store whole table as ORC from SparkSQL. Other Hive users will then see the table existence and may query it.