Support Questions

Find answers, ask questions, and share your expertise

How to change location of avro.schema.url location in Hive

avatar
Expert Contributor

I have Hive external tables using Avro serdes that look like this:

CREATE EXTERNAL TABLE `rm`(
  `gw_id` bigint COMMENT '', 
  `rm_id` string COMMENT '', 
  `start_time` bigint COMMENT '', 
  `end_time` bigint COMMENT '', 
  `switch_failure_count` bigint COMMENT '')
PARTITIONED BY ( 
  `year` int, 
  `month` int, 
  `day` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
WITH SERDEPROPERTIES ( 
  'avro.schema.url'='hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  'hdfs://mpws:8020/jup1_stats/external_tables/rm'
TBLPROPERTIES (
  'last_modified_by'='mpetronic', 
  'last_modified_time'='1463833730', 
  'transient_lastDdlTime'='1463833730')

The Avro schemas are stored on HDFS in /jup1_stats/schema/. I want to change that path from this:

/jup1_stats/schema/

to this:

/jup1_stats/avro/

So, I copied my avro schema files from /jup1_stats/schema to /jup1_stats/avro and did this:

alter table rm set serdeproperties ('avro.schema.url'='hdfs://mpws:8020/stats/avro/rm_1.avsc');

I ran a test query and it worked. But not so fast! Then I did the clean up and removed the /jup1_stats/schema directory and ran the test query again and got tons of stack traces and the query failed.

ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1463688329951_0007_4_00, diagnostics=[Task failed, taskId=task_1463688329951_0007_4_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
    at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
    at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
    at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
    at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
    ... 14 more
Caused by: java.io.IOException: java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
    ... 25 more

It appears that Hive is still looking for the schema in the old location based on this error trace:

Unable to read schema from given path: hdfs://mpws:8020/jup1_stats/schema/rm_1.avsc

I also tried this but the queries still fail:

msck repair table rm;

If I recreate the schema files in the old location (/jup1_stats/schema), the query then works again.

These are external tables, and I did try dropping the table and recreating it with the new location in avro.schema.url and that works. But, I would like to understand why I cannot use "alter table" to achieve this same result.

Thanks!

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:

mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
            TBL_ID: 170
       CREATE_TIME: 1463833647
             DB_ID: 11
  LAST_ACCESS_TIME: 0
             OWNER: mpetronic
         RETENTION: 0
             SD_ID: 227
          TBL_NAME: rm
          TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
    LINK_TARGET_ID: NULL
1 row in set (0.00 sec)

mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
                    SD_ID: 227
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 227
*************************** 2. row ***************************
                    SD_ID: 228
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 228
*************************** 3. row ***************************
                    SD_ID: 229
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 229

Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?

alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');

Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.

View solution in original post

1 REPLY 1

avatar
Expert Contributor

I stumbled onto my own answer. Appears that you have to provide a partition spec to the alter command. I figured this out after poking around in mysql to look at my hive metastore to see if that would give me a clue. These queries are what made me think that the serde information is on a partition-by-partition basis:

mysql> select * from TBLS where TBL_NAME='rm'\G
*************************** 1. row ***************************
            TBL_ID: 170
       CREATE_TIME: 1463833647
             DB_ID: 11
  LAST_ACCESS_TIME: 0
             OWNER: mpetronic
         RETENTION: 0
             SD_ID: 227
          TBL_NAME: rm
          TBL_TYPE: EXTERNAL_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
    LINK_TARGET_ID: NULL
1 row in set (0.00 sec)

mysql> select * from SDS where CD_ID=170\G
*************************** 1. row ***************************
                    SD_ID: 227
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 227
*************************** 2. row ***************************
                    SD_ID: 228
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=10
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 228
*************************** 3. row ***************************
                    SD_ID: 229
                    CD_ID: 170
             INPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
            IS_COMPRESSED:  
IS_STOREDASSUBDIRECTORIES:  
                 LOCATION: hdfs://mpws:8020/jup1_stats/external_tables/rm/year=2016/month=5/day=11
              NUM_BUCKETS: -1
            OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
                 SERDE_ID: 229

Once I saw all those 'LOCATION' values in the SDS table, I tried the following command to alter the table and then the query worked. Interesting. Does this imply that you could have a different schema for each partition? Can anyone comment as to why the avro schema is tied to a partition and not simply to the whole table?

alter table rm partition (year=2016) set serdeproperties ('avro.schema.url' = 'hdfs://mpws:8020/jup1_stats/avro/rm_1.avsc');

Since all my data is in partitions under the "year=2016" partition, I was able to just specify that one partition and it applied the change to all partitions under that.