Community Articles

Find and share helpful community-sourced technical articles.
avatar
Guru

This article is a companion to the article "Avro Schema Registry with Apache Atlas for Streaming Data Management".

https://community.hortonworks.com/articles/51379/avro-schema-registry-with-apache-atlas-for-streami....

The article explores how an Avro schema registry can bring data governance to streaming data and the benefits that come with it. This tutorial demonstrates the implementation of this concept and some of the resulting features.

  1. Download HDP 2.5 Sandbox
  2. modify the hosts file on the local machine to resolve sandbox.hortonworks.com to 127.0.0.1
  3. SSH to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222)
    1. ambari-admin-password-reset
      1. make sure to set the Ambari password to "admin"
  4. Log into Ambari and start the following services (http://sandbox.hortonworks.com:8080/)
    1. HBase
    2. Log Search
    3. Kafka
    4. Atlas
  5. From the SSH console:
    1. git clone https://github.com/vakshorton/AvroSchemaShredder
    2. cd /root/AvroSchemaShredder
    3. chmod 755 install.sh
    4. ./install.sh
    5. java -jar AvroSchemaShredder-jar-with-dependencies.jar
  6. Open a second SSH session to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222)
    1. cd /root/AvroSchemaShredder
    2. curl -u admin:admin -d @schema/schema_1.json -H "Content-Type: application/json" -X POST http://sandbox.hortonworks.com:8090/schemaShredder/storeSchema
      1. Curl will make a REST API call to the AvroSchemaShredder service to parse the sample Avro schema and store it in Atlas.
  7. Log into Atlas: http://sandbox.hortonworks.com:21000 (usr:admin, pass:admin)
  8. Search for "avro_schema". The search should return a list of schemas that were created when the request to register schemas was made via the REST service call.
    1. Click into one of the schemas, notice the available information about the top level record
    2. The record will have a "fields" attribute that contains links to other sub elements and in some cases, other schemas

Now any of the fields of any registered schema can be searched and tagged. Schemas can be associated with Kafka topics allowing discovery of streaming data sources on those topics. Also, notice that the curl REST call returned a GUID. That GUID can be used to access the schema that was registered. This means that a message can be automatically deserialized from a Kafka topic based on the "fingerprint" associated to the message on the Kafka topic. This could be achieved using a standard client that depends on the Avro Schema Registry to deserialize messages.

  1. To retrieve the Avro compliant schema notation:
    1. get the GUID that the curl command returned after the sample schema was registered
    2. curl -u admin:admin -X GET http://sandbox.hortonworks.com:8090/schemaShredder/getSchema/{GUID}
    3. The response should be an Avro compliant schema descriptor

This prototype does not handle schema validation or compatibility enforcement. It also does not do any caching to optimize performance or leverage Kafka for asynchronous notification. However, it does demonstrate how the described capabilities can be achieved.

Repo: https://community.hortonworks.com/content/repo/51366/avro-schema-shredder.html

7,371 Views