Community Articles

vvaks · ‎08-15-2016

This article is a companion to the article "Avro Schema Registry with Apache Atlas for Streaming Data Management".

https://community.hortonworks.com/articles/51379/avro-schema-registry-with-apache-atlas-for-streami....

The article explores how an Avro schema registry can bring data governance to streaming data and the benefits that come with it. This tutorial demonstrates the implementation of this concept and some of the resulting features.

Download HDP 2.5 Sandbox
modify the hosts file on the local machine to resolve sandbox.hortonworks.com to 127.0.0.1
SSH to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222)
1. ambari-admin-password-reset
  1. make sure to set the Ambari password to "admin"
Log into Ambari and start the following services (http://sandbox.hortonworks.com:8080/)
1. HBase
2. Log Search
3. Kafka
4. Atlas
From the SSH console:
1. git clone https://github.com/vakshorton/AvroSchemaShredder
2. cd /root/AvroSchemaShredder
3. chmod 755 install.sh
4. ./install.sh
5. java -jar AvroSchemaShredder-jar-with-dependencies.jar
Open a second SSH session to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222)
1. cd /root/AvroSchemaShredder
2. curl -u admin:admin -d @schema/schema_1.json -H "Content-Type: application/json" -X POST http://sandbox.hortonworks.com:8090/schemaShredder/storeSchema
  1. Curl will make a REST API call to the AvroSchemaShredder service to parse the sample Avro schema and store it in Atlas.
Log into Atlas: http://sandbox.hortonworks.com:21000 (usr:admin, pass:admin)
Search for "avro_schema". The search should return a list of schemas that were created when the request to register schemas was made via the REST service call.
1. Click into one of the schemas, notice the available information about the top level record
2. The record will have a "fields" attribute that contains links to other sub elements and in some cases, other schemas

Now any of the fields of any registered schema can be searched and tagged. Schemas can be associated with Kafka topics allowing discovery of streaming data sources on those topics. Also, notice that the curl REST call returned a GUID. That GUID can be used to access the schema that was registered. This means that a message can be automatically deserialized from a Kafka topic based on the "fingerprint" associated to the message on the Kafka topic. This could be achieved using a standard client that depends on the Avro Schema Registry to deserialize messages.

To retrieve the Avro compliant schema notation:
1. get the GUID that the curl command returned after the sample schema was registered
2. curl -u admin:admin -X GET http://sandbox.hortonworks.com:8090/schemaShredder/getSchema/{GUID}
3. The response should be an Avro compliant schema descriptor

This prototype does not handle schema validation or compatibility enforcement. It also does not do any caching to optimize performance or leverage Kafka for asynchronous notification. However, it does demonstrate how the described capabilities can be achieved.

Repo: https://community.hortonworks.com/content/repo/51366/avro-schema-shredder.html

Cloudera Community

Community Articles

Apache Atlas as an Avro Schema Registry Test Drive

Apache Atlas

Apache Kafka

Schema Registry

Avro Schema Registry with Apache Atlas for Streami...

AVRO Schema Registry Namespace

Better Together: NiFi, Schema Registry and Streami...

Versioned DataFlows with Apache NiFi 1.5 and Apach...

CSV to AVRO Conversion with NiFi Debugging, Checki...

Generating AVRO Schemas and Ensuring Field Names M...

Hive data lineage using Apache Atlas

More DevOps for HDF, Apache NiFi, Registry and Fri...

Intro to Apache Atlas - Tags and Lineage

Schema registry Kerberos Authentication required