Support Questions
Find answers, ask questions, and share your expertise

Hive and Avro schema defined in TBLPROPERTIES vs. SERDEPROPERTIES

Highlighted

Hive and Avro schema defined in TBLPROPERTIES vs. SERDEPROPERTIES

Rising Star

I have about 18 external tables that are stored as Avro and now I am trying to access the using Sparks HiveContext from a PySpark program. I am getting the same errors that are described here. I've been pumping data into these tables for awhile now and queries against them work fine using Hive on Tez - only Spark access is problemmatic for some reason. Because these are external tables, I can simply drop them and recreate them using the SERDEPROPERTIES method instead of TBLPROPERTIES approach described in the link. I tried it on one table and it worked. When testing on one of my external tables, I tried to use alter table to add the SERDEPROPERTIES with the Arvo schema URL hoping that would be an easy fix but still got the same Spark errors. So looks like the drop/recreate approach must be done.

Before the alter table, the table looked like this:

+---------------------------------------------------------------------------------------+--+
|                                    createtab_stmt                                     |
+---------------------------------------------------------------------------------------+--+
| CREATE EXTERNAL TABLE `cro_priority`(                                                 |
|   `gateway_id` bigint COMMENT '',                                                     |
|   `cro_id` string COMMENT '',                                                         |
|   `collection_start_time` bigint COMMENT '',                                          |
|   `max_queue_depth` bigint COMMENT '')                                                |
| PARTITIONED BY (                                                                      |
|   `year` int,                                                                         |
|   `month` int,                                                                        |
|   `day` int)                                                                          |
| ROW FORMAT SERDE                                                                      |
|   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'                                      |
| WITH SERDEPROPERTIES (                                                                |
|   'avro.schema.url'='hdfs://vmwhaddev01:8020/jupstats/schemas/cro_priority_1.avsc')   |
| STORED AS INPUTFORMAT                                                                 |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'                        |
| OUTPUTFORMAT                                                                          |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'                       |
| LOCATION                                                                              |
|   'hdfs://vmwhaddev01:8020/jupstats/external_tables/cro_priority'                     |
| TBLPROPERTIES (                                                                       |
|   'transient_lastDdlTime'='1454967431')                                               |
+---------------------------------------------------------------------------------------+--+

After the alter table, it looked like this. Note that the Avro URL is defined now in BOTH the SERDEPROPERTIES and the TBLPROPERTIES. Not sure if that is why it did not fix it.

+---------------------------------------------------------------------------------------+--+
|                                    createtab_stmt                                     |
+---------------------------------------------------------------------------------------+--+
| CREATE EXTERNAL TABLE `cro_priority`(                                                 |
|   `gateway_id` bigint COMMENT '',                                                     |
|   `cro_id` string COMMENT '',                                                         |
|   `collection_start_time` bigint COMMENT '',                                          |
|   `max_queue_depth` bigint COMMENT '')                                                |
| PARTITIONED BY (                                                                      |
|   `year` int,                                                                         |
|   `month` int,                                                                        |
|   `day` int)                                                                          |
| ROW FORMAT SERDE                                                                      |
|   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'                                      |
| WITH SERDEPROPERTIES (                                                                |
|   'avro.schema.url'='hdfs://vmwhaddev01:8020/jupstats/schemas/cro_priority_1.avsc')   |
| STORED AS INPUTFORMAT                                                                 |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'                        |
| OUTPUTFORMAT                                                                          |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'                       |
| LOCATION                                                                              |
|   'hdfs://vmwhaddev01:8020/jupstats/external_tables/cro_priority'                     |
| TBLPROPERTIES (                                                                       |
|   'avro.schema.url'='hdfs://vmwhaddev01:8020/jupstats/schemas/cro_priority_1.avsc',   |
|   'transient_lastDdlTime'='1454967431')                                               |
+---------------------------------------------------------------------------------------+--+

So, I had to drop and recreate the with just the SERDEPROPERTIES to make it work in Spark.

My specific question is this... I have other Hive-managed tables that I cannot so easily convert because I cannot drop them without loosing data and they are very large tables. Is there a way to alter the table so that I end up in the same state as I do with a drop and recreate?

2 REPLIES 2
Highlighted

Re: Hive and Avro schema defined in TBLPROPERTIES vs. SERDEPROPERTIES

@Mark Petronic You have answered your own question :)

Re: Hive and Avro schema defined in TBLPROPERTIES vs. SERDEPROPERTIES

@Mark Petronic See if this makes sense

1968-screen-shot-2016-02-09-at-101724-pm.png