How to change location of avro.schema.url location in Hive

Mark_Petronic — Sat, 21 May 2016 19:46:08 GMT

I have Hive external tables using Avro serdes that look like this:

CREATE EXTERNAL TABLE `rm`(
  `gw_id` bigint COMMENT '', 
  `rm_id` string COMMENT '', 
  `start_time` bigint COMMENT '', 
  `end_time` bigint COMMENT '', 
  `switch_failure_count` bigint COMMENT '')
PARTITIONED BY ( 
  `year` int, 
  `month` int, 
  `day` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
WITH SERDEPROPERTIES ( 
  'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  'hdfs://mpws:8020/jup1_stats/external_tables/rm'
TBLPROPERTIES (
  'last_modified_by'='mpetronic', 
  'last_modified_time'='1463833730', 
  'transient_lastDdlTime'='1463833730')

The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this:

/jup1_stats/schema/

to this:

/jup1_stats/avro/

So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this:

alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc');

I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed.

ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
    at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
    at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
    at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
    at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
    ... 14 more
Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
    ... 25 more

It appears that Hive is still looking for the schema in the old location based on this error trace:

Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc

I also tried this but the queries still fail:

msck repair table rm;

If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again.

These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result.

Thanks!

Re: How to change location of avro.schema.url location in Hive

Mark_Petronic — Sat, 21 May 2016 21:32:07 GMT

I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:

mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
            TBL_ID: 170
       CREATE_TIME: 1463833647
             DB_ID: 11
  LAST_ACCESS_TIME: 0
             OWNER: mpetronic
         RETENTION: 0
             SD_ID: 227
          TBL_NAME: rm
          TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
    LINK_TARGET_ID: NULL
1 row in set (0.00 sec)

mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
                    SD_ID: 227
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 227
*************************** 2. row ***************************
                    SD_ID: 228
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 228
*************************** 3. row ***************************
                    SD_ID: 229
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 229

Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?

alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');

Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.

question How to change location of avro.schema.url location in Hive in Archives of Support Questions (Read Only)

How to change location of avro.schema.url location in Hive

Re: How to change location of avro.schema.url location in Hive