The article explores how an Avro schema registry can bring data governance to streaming data and the benefits that come with it. This tutorial demonstrates the implementation of this concept and some of the resulting features.
Download HDP 2.5 Sandbox
modify the hosts file on the local machine to resolve sandbox.hortonworks.com to 127.0.0.1
SSH to the Sandbox (ssh email@example.com -p 2222)
make sure to set the Ambari password to "admin"
Log into Ambari and start the following services (http://sandbox.hortonworks.com:8080/)
Search for "avro_schema". The search should return a list of schemas that were created when the request to register schemas was made via the REST service call.
Click into one of the schemas, notice the available information about the top level record
The record will have a "fields" attribute that contains links to other sub elements and in some cases, other schemas
Now any of the fields of any registered schema can be searched and tagged. Schemas can be associated with Kafka topics allowing discovery of streaming data sources on those topics. Also, notice that the curl REST call returned a GUID. That GUID can be used to access the schema that was registered. This means that a message can be automatically deserialized from a Kafka topic based on the "fingerprint" associated to the message on the Kafka topic. This could be achieved using a standard client that depends on the Avro Schema Registry to deserialize messages.
To retrieve the Avro compliant schema notation:
get the GUID that the curl command returned after the sample schema was registered
The response should be an Avro compliant schema descriptor
This prototype does not handle schema validation or compatibility enforcement. It also does not do any caching to optimize performance or leverage Kafka for asynchronous notification. However, it does demonstrate how the described capabilities can be achieved.