Created 02-08-2016 11:25 PM
I have about 18 external tables that are stored as Avro and now I am trying to access the using Sparks HiveContext from a PySpark program. I am getting the same errors that are described here. I've been pumping data into these tables for awhile now and queries against them work fine using Hive on Tez - only Spark access is problemmatic for some reason. Because these are external tables, I can simply drop them and recreate them using the SERDEPROPERTIES method instead of TBLPROPERTIES approach described in the link. I tried it on one table and it worked. When testing on one of my external tables, I tried to use alter table to add the SERDEPROPERTIES with the Arvo schema URL hoping that would be an easy fix but still got the same Spark errors. So looks like the drop/recreate approach must be done.
Before the alter table, the table looked like this:
+---------------------------------------------------------------------------------------+--+ | createtab_stmt | +---------------------------------------------------------------------------------------+--+ | CREATE EXTERNAL TABLE `cro_priority`( | | `gateway_id` bigint COMMENT '', | | `cro_id` string COMMENT '', | | `collection_start_time` bigint COMMENT '', | | `max_queue_depth` bigint COMMENT '') | | PARTITIONED BY ( | | `year` int, | | `month` int, | | `day` int) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | | WITH SERDEPROPERTIES ( | | 'avro.schema.url'='hdfs://vmwhaddev01:8020/jupstats/schemas/cro_priority_1.avsc') | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | | LOCATION | | 'hdfs://vmwhaddev01:8020/jupstats/external_tables/cro_priority' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1454967431') | +---------------------------------------------------------------------------------------+--+
After the alter table, it looked like this. Note that the Avro URL is defined now in BOTH the SERDEPROPERTIES and the TBLPROPERTIES. Not sure if that is why it did not fix it.
+---------------------------------------------------------------------------------------+--+ | createtab_stmt | +---------------------------------------------------------------------------------------+--+ | CREATE EXTERNAL TABLE `cro_priority`( | | `gateway_id` bigint COMMENT '', | | `cro_id` string COMMENT '', | | `collection_start_time` bigint COMMENT '', | | `max_queue_depth` bigint COMMENT '') | | PARTITIONED BY ( | | `year` int, | | `month` int, | | `day` int) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | | WITH SERDEPROPERTIES ( | | 'avro.schema.url'='hdfs://vmwhaddev01:8020/jupstats/schemas/cro_priority_1.avsc') | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | | LOCATION | | 'hdfs://vmwhaddev01:8020/jupstats/external_tables/cro_priority' | | TBLPROPERTIES ( | | 'avro.schema.url'='hdfs://vmwhaddev01:8020/jupstats/schemas/cro_priority_1.avsc', | | 'transient_lastDdlTime'='1454967431') | +---------------------------------------------------------------------------------------+--+
So, I had to drop and recreate the with just the SERDEPROPERTIES to make it work in Spark.
My specific question is this... I have other Hive-managed tables that I cannot so easily convert because I cannot drop them without loosing data and they are very large tables. Is there a way to alter the table so that I end up in the same state as I do with a drop and recreate?
Created 02-10-2016 03:11 AM
@Mark Petronic You have answered your own question :)
Created on 02-10-2016 03:18 AM - edited 08-19-2019 02:22 AM
@Mark Petronic See if this makes sense