Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can Hive avro tables support changing schemas?

avatar
Rising Star

Avro, in general, supports the idea of evolving schemas and I'm trying to support that with an external Hive table. In other words, declare an external hive table and define its schema through the table properties to be read from an HDFS file:

TBLPROPERTIES ('avro.schema.url'='hdfs://namenode/common/schemas/schema_v1.avsc')

This works fine when all of the files in the directory for the external table are create with schema version 1. However, if i add avro files of version 2 and update the tblproperties accordingly the table becomes unusable. I see errors like this on a select count(*) statement

Caused by: org.apache.avro.AvroTypeException: Found com.target.category_data, expecting com.target.category_data
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:176)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:153)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)

It appears that the Hive Avro serde is using the schema in tblproperties to decode the individual avro files (or more accurately the individual file splits, I supose) instead of the schema in the header of each avro file. I would like for binary avro files created with different avro schemas to be read by the same hive table with a potentially different avro schema. Could anyone suggest ways to do that? I am not opposed to making code changes on the read side (i.e. the hive serde) or the write side (I'm using Storm's avro bolt to write my files).

Thanks! -Aaron

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hive is trying to provide a value for the new column for those records where it did not exist, you need to specify a default for the new columns using 'avro.schema.literal table property.

In the below example original tables has just one column. Age is added in the second version of schema. If the file that it’s reading is of a different schema, it will attempt to convert it using Avro schema resolution. The entire definition is in the avro.schema.literal property.

ALTER TABLE test_avro SET TBLPROPERTIES (

'avro.schema.literal'='{"name":"test_record",

"type":"record",

"fields": [

{"name":"full_name", "type":"string"},

{"name":"age", "type":"int", "default":999}]}');

Hope this helps.

View solution in original post

5 REPLIES 5

avatar
Expert Contributor

Hive is trying to provide a value for the new column for those records where it did not exist, you need to specify a default for the new columns using 'avro.schema.literal table property.

In the below example original tables has just one column. Age is added in the second version of schema. If the file that it’s reading is of a different schema, it will attempt to convert it using Avro schema resolution. The entire definition is in the avro.schema.literal property.

ALTER TABLE test_avro SET TBLPROPERTIES (

'avro.schema.literal'='{"name":"test_record",

"type":"record",

"fields": [

{"name":"full_name", "type":"string"},

{"name":"age", "type":"int", "default":999}]}');

Hope this helps.

avatar
Rising Star

Thanks @Gangadhar, the schema changes should have been compatible in the way you describe, they are just in an HDFS file not in a schema literal. I'll try to recreate this problem with a simple change.

avatar
Rising Star

You were exactly right. A particular change I made was backward compatible, but not forward compatible. When tested the right way, Hive performed exactly as expected.

avatar
Expert Contributor

I built a DWH style application on Hive that is centered on Avro to handle schema evolution. My source data is CSV and they change when new releases of the applications are deployed (like adding more columns, removing columns, etc). So, I decided to convert everything to Avro on ingest and then strictly deal with Avro inside Hadoop. For most of my data, I convert to Avro and drop into HDFS into date partitions and allow users to access that data using external Hive tables. For some of these CSV files, there are up to 6 versions in flight at once (each defined using a separate Avro schema), all landing in the same Hive table. So, like you, I wanted to be able to query over all the data no matter what the version. The approach seems to be working well.

I did, however want to chime in on TBLPROPERTIES. I just posted a questions related to this. Seems like we should be using SERDEPROPERTIES, not TBLPROPERTIES, when defining the URL to the schema file. All my Avro schema files are in an HDFS directory - I did not want to use the literal approach of defining the schema inline in the TBLPROPERTIES. I was creating me Hive tables using TBLPROPERTIES pointing to the URL of the newest schema which is defined with defaults and properly defined to be a superset of all earlier schemas allowing them to all coexist in the same Hive table. However, I recently tried to build a Spark SQL application using HiveContext to read these tables and was surprise to find that Spark threw Avro exceptions. Creating the table using SERDEPROPERTIES to define the avcs URL was the solution to make the data accessible from both Hive and Spark. Just though I would mention to save you some hassles down the road if you every need Spark SQL access to that data.

avatar
Expert Contributor