Created 05-21-2016 12:46 PM
I have Hive external tables using Avro serdes that look like this:
CREATE EXTERNAL TABLE `rm`( `gw_id` bigint COMMENT '', `rm_id` string COMMENT '', `start_time` bigint COMMENT '', `end_time` bigint COMMENT '', `switch_failure_count` bigint COMMENT '') PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://mpws:8020/jup1_stats/external_tables/rm' TBLPROPERTIES ( 'last_modified_by'='mpetronic', 'last_modified_time'='1463833730', 'transient_lastDdlTime'='1463833730')
The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this:
/jup1_stats/schema/
to this:
/jup1_stats/avro/
So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this:
alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc');
I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed.
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149) ... 14 more Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193) ... 25 more
It appears that Hive is still looking for the schema in the old location based on this error trace:
Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
I also tried this but the queries still fail:
msck repair table rm;
If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again.
These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result.
Thanks!
Created 05-21-2016 02:32 PM
I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:
mysql> select * from TBLS where TBL_NAME='rm'\G *************************** 1. row *************************** TBL_ID: 170 CREATE_TIME: 1463833647 DB_ID: 11 LAST_ACCESS_TIME: 0 OWNER: mpetronic RETENTION: 0 SD_ID: 227 TBL_NAME: rm TBL_TYPE: EXTERNAL_TABLE VIEW_EXPANDED_TEXT: NULL VIEW_ORIGINAL_TEXT: NULL LINK_TARGET_ID: NULL 1 row in set (0.00 sec) mysql> select * from SDS where CD_ID=170\G *************************** 1. row *************************** SD_ID: 227 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 227 *************************** 2. row *************************** SD_ID: 228 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10 NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 228 *************************** 3. row *************************** SD_ID: 229 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11 NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 229
Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?
alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');
Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.
Created 05-21-2016 02:32 PM
I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:
mysql> select * from TBLS where TBL_NAME='rm'\G *************************** 1. row *************************** TBL_ID: 170 CREATE_TIME: 1463833647 DB_ID: 11 LAST_ACCESS_TIME: 0 OWNER: mpetronic RETENTION: 0 SD_ID: 227 TBL_NAME: rm TBL_TYPE: EXTERNAL_TABLE VIEW_EXPANDED_TEXT: NULL VIEW_ORIGINAL_TEXT: NULL LINK_TARGET_ID: NULL 1 row in set (0.00 sec) mysql> select * from SDS where CD_ID=170\G *************************** 1. row *************************** SD_ID: 227 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 227 *************************** 2. row *************************** SD_ID: 228 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10 NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 228 *************************** 3. row *************************** SD_ID: 229 CD_ID: 170 INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat IS_COMPRESSED: IS_STOREDASSUBDIRECTORIES: LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11 NUM_BUCKETS: -1 OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat SERDE_ID: 229
Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?
alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');
Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.