Created 05-21-2016 12:46 PM
I have Hive external tables using Avro serdes that look like this:
CREATE EXTERNAL TABLE `rm`( `gw_id` bigint COMMENT '', `rm_id` string COMMENT '', `start_time` bigint COMMENT '', `end_time` bigint COMMENT '', `switch_failure_count` bigint COMMENT '') PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://mpws:8020/jup1_stats/external_tables/rm' TBLPROPERTIES ( 'last_modified_by'='mpetronic', 'last_modified_time'='1463833730', 'transient_lastDdlTime'='1463833730')
The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this:
/jup1_stats/schema/
to this:
/jup1_stats/avro/
So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this:
alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc');I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed.
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
... 14 more
Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 25 moreIt appears that Hive is still looking for the schema in the old location based on this error trace:
Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
I also tried this but the queries still fail:
msck repair table rm;
If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again.
These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result.
Thanks!
Created 05-21-2016 02:32 PM
I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:
mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
TBL_ID: 170
CREATE_TIME: 1463833647
DB_ID: 11
LAST_ACCESS_TIME: 0
OWNER: mpetronic
RETENTION: 0
SD_ID: 227
TBL_NAME: rm
TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
LINK_TARGET_ID: NULL
1 row in set (0.00 sec)
mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
SD_ID: 227
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 227
*************************** 2. row ***************************
SD_ID: 228
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 228
*************************** 3. row ***************************
SD_ID: 229
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 229
Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?
alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.
Created 05-21-2016 02:32 PM
I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:
mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
TBL_ID: 170
CREATE_TIME: 1463833647
DB_ID: 11
LAST_ACCESS_TIME: 0
OWNER: mpetronic
RETENTION: 0
SD_ID: 227
TBL_NAME: rm
TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
LINK_TARGET_ID: NULL
1 row in set (0.00 sec)
mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
SD_ID: 227
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 227
*************************** 2. row ***************************
SD_ID: 228
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 228
*************************** 3. row ***************************
SD_ID: 229
CD_ID: 170
INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
IS_COMPRESSED:
IS_STOREDASSUBDIRECTORIES:
LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
SERDE_ID: 229
Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?
alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.