Support Questions

Find answers, ask questions, and share your expertise

Avro schema update with two schema in one avro file

avatar
New Contributor

I have one avro file with first schema then I updated the schema that appends to the same file. So now I have two schemas in one file. How does avro handle this scenario. Will I have any new fields add in the file or will I loose any data while reading this data. This is a real time streaming application where I am writing the data to hdfs. My upstream system might update the schema but the hdfs writer might be on old schema. So the hdfs avro file will have two schemas until I update the writer to handle the newer schema.

Note - I don't have schema registry and I am creating one avro file per day. So if a schema is updated in the middle of the day, I will have one avro file with two schemas.

2 REPLIES 2

avatar

Basics about avro (which differentiates it with thrift) in light of missing schema registry
1. Avro serialized data has no schema saved in the file.
2. User has to provide schema, both at write and read time .
3. Avro provides utility to check for schema evolution consistency check , hence its onto the user to make sure the avro schema evolution is compatible .

In your case You will have to provide the schema wile reading .
1. Avro will try its best effort to convert the data saved based on the "read" schema

2. If the read schema has some missing or extra fields , based on default value it will fulfilled or nullified

3. If " write" and "read" schema are incompatible You will get exception .

SchemaCompatibility.SchemaPairCompatibility compatResult =SchemaCompatibility.checkReaderWriterCompatibility(newSchema, oldSchema);

Please follow the link

http://bytepadding.com/big-data/spark/avro/avro-schema-compatibility-test/

avatar
New Contributor

Thanks kgautam for the reply. You said Avro serialized data has no schema saved in the file. This is what I read "Schema is stored along with the Avro data in a file."

So basically, if I am using a Avro writer that uses old schema. This same old writer should be able to handle records coming in with new schema (as long as we follow the rules) until I update the writer to use the new schema. If later I update the writer to use the new schema will it be able to give me all the data with new schema? Again all the data with new and old schema are in the same file. I want to understand if I will loose any new fields that I add while writing the data with old schema.

Thanks