Created 12-08-2015 02:09 PM
Reference: https://avro.apache.org/docs/1.7.7/spec.html#Alia...
I am setting up a new cluster and work flow to ingest CSV into Hive for a data warehouse style application. I have chosen to convert everything to Avro on the edge of ingest to standardize all processing on Avro to better handle schema evolution and detection and unify downstream processing. My question is in regard to best practice recommendations for using aliases in the Avro schemas when evolving one stat 'class' of CSV over time where a 'class' is one or more CSV files that all contain the same logical set of stats but, over time, more stats are added as the feature that generates that class of stats is enhanced. So, for example, I might see a files like these over time which all belong in the same Hive table and same stats 'class' of data:
StatClassX_Version_1.csv device_id,timestamp,bytes_sent,bytes_received StatClassX_Version_2.csv (adds one new field) device_id,timestamp,bytes_sent,bytes_received,packet_discarded StatClassX_Version_3.csv (adds one new field, renames one existing field) device_id,timestamp,bytes_sent,bytes_received,packets_discarded,retries
So, I want to drop all three of these in one 'StatClassX' Hive table. This is totally supported in Hive when using Avro storage format and by following proper schema evolution policies. This I have proven and have no questions about. My question is more about the best practices in evolving the Avro schema.
To make this example work would require three schema like the following:
{ "type": "record", "name": "StatClassX", "version": "1", "namespace": "com.example.stats", "fields": [{ "name": "device_id", "type": "string" }, { "name": "timestamp", "type": "string" }, { "name": "bytes_sent", "type": "long" }, { "name": "bytes_received", "type": "long" }] } { "type": "record", "name": "StatClassX", "version": "2", "namespace": "com.example.stats", "fields": [{ "name": "device_id", "type": "string" }, { "name": "timestamp", "type": "string" }, { "name": "bytes_sent", "type": "long" }, { "name": "bytes_received", "type": "long" }, { "name": "packet_discarded", "type": "long" "default": "0" }] } { "type": "record", "name": "StatClassX", "version": "3", "namespace": "com.example.stats", "fields": [{ "name": "device_id", "type": "string" }, { "name": "timestamp", "type": "string" }, { "name": "bytes_sent", "type": "long" }, { "name": "bytes_received", "type": "long" }, { "name": "packets_discarded", "type": "long", "default": "0", "aliases": ["packet_discarded"] }, { "name": "retries", "type": "int", "default": "0" }] }
Here, I have used a user-defined property field "version" to indicate the schema version which my custom conversion logic uses for reasons outside the scope of this question. But, it also happens to be the only 'human-readable' easy way to tell one schema from another and some how tie it back, to say, some Word document that defines the source CSV schema. Getting to the point, about aliases, I have introduced one in version 2 to handle the name change of one field. However, what I discovered recently is that you can also use aliases on the schema name itself - I thought it only applied to field names.
Question: I know I MUST use the alias on the renamed field. However, I could also name each schema with a name that captures the version, For example, versions 1-3 could be named as follows using header level aliases:
{ "type": "record", "name": "StatClassX.1", "namespace": "com.example.stats", "fields": [{...}] } { "type": "record", "name": "StatClassX.2", "aliases": ["StatClassX.1"], "namespace": "com.example.stats", "fields": [{...}] } { "type": "record", "name": "StatClassX.3", "aliases": ["StatClassX.1", "StatClassX.2"], "namespace": "com.example.stats", "fields": [{...}] }
This intuitively feels weird. I prefer that the schema name be more reflective of the "class" of stats and not be overloaded with some notion of version. But, I just wonder what use case might exist for using different schema names that actually all refer to a similar class of data? One thought would be that you just decided to adopt a different naming convention and this would allow you to roll that out. For versioning, it does not seem to fit but I want to solicit thoughts from some experts nonetheless.
Thanks!
Created 01-15-2016 06:23 PM
@Mark Herring No, I haven't actually pursued this too much beyond this post. My gut feel, and the way I have implemented it, is that I use the same name for all schema versions that are of the same "logical type". That just felt the most right to me. I still include the custom "version" field, as I noted in the post, as both a book-keeping feature (for human consumption) and to use programmatically. For example, I am coalescing 1000's of smaller CSV files into one larger Avro file. But, since I have multiple versions of a given CSV "logical type" in flight, say 3, I create up to three different Avro files and pack them with all the CSV data to aligns with one of thee Avro schemas. In that case, my "packer" app uses the version field from the schema to build a filename to help show that this file contains version 2 data. This is probably more for me to 'see' it in directory listings, etc, but it helps me in debugging and monitoring as this is my first build on Hadoop so the more "human friendly" the file names and paths are, the easier I can ensure my design is working. I would rather see vsat_modc_stats__v2__20160114_123456.avro than 000000023_1. 🙂 The __v2__ is the schema version in this example.
Created 01-15-2016 05:49 PM
@Mark Petronic did you ever find out the answer?
Created 01-15-2016 06:23 PM
@Mark Herring No, I haven't actually pursued this too much beyond this post. My gut feel, and the way I have implemented it, is that I use the same name for all schema versions that are of the same "logical type". That just felt the most right to me. I still include the custom "version" field, as I noted in the post, as both a book-keeping feature (for human consumption) and to use programmatically. For example, I am coalescing 1000's of smaller CSV files into one larger Avro file. But, since I have multiple versions of a given CSV "logical type" in flight, say 3, I create up to three different Avro files and pack them with all the CSV data to aligns with one of thee Avro schemas. In that case, my "packer" app uses the version field from the schema to build a filename to help show that this file contains version 2 data. This is probably more for me to 'see' it in directory listings, etc, but it helps me in debugging and monitoring as this is my first build on Hadoop so the more "human friendly" the file names and paths are, the easier I can ensure my design is working. I would rather see vsat_modc_stats__v2__20160114_123456.avro than 000000023_1. 🙂 The __v2__ is the schema version in this example.