Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!

Data is the becoming the new precious resource. As the world produces more and more data, business units find increasingly more ways to monetize that data. This means that data that used to be retained for a short time or not at all, is now being persisted long term. This data is being gathered from more and more sources and not necessarily from within the organization that uses it. It is also increasingly being generated by machines, meaning that the volume, velocity, and variety of the data proliferate at an overwhelming rate. There are now lots of tools that enable an organization to address the challenges imposed by the proliferation of data. However, many organizations have been focused on dealing with volume and velocity while not focusing on the challenges created by the lack or inconsistency of structure.

In order to truly unlock the power of all that data, an organization must first apply a consistent set of guidelines for governance of the data. Getting value from new data sources often requires imposing schemas on unstructured or semi-structured data. This is because the new data often has to be combined with existing structured data in order for it to be useful. Schemas can also be important for security as sensitive bits of data are often mixed in data sets that are generally considered non-sensitive. Finally, business units generally do not create the technologies that monetize the data. That job falls to many different engineering groups that are often decentralized. In order to effectively create the tools that enable harvesting value from data, engineering teams need to agree on how that data should be used, modified, and enriched. Consider a scenario where two different engineering teams are working on requirements from two different business units and have no knowledge of the other's work. When team A wants to evolve the schema of some data set, they must be sure that the change will not disrupt the work of team B. This is challenging since team A may not know that team B is using the same data or what they are doing with it. In addition, team B will likely derive a new data set from the existing data. That new data set may be exactly what team A needs to deliver what the business has asked for. Team A needs to be able to discover the fact that team B has produced a new data set from the one that both teams were using.

It used to be that data was primarily stored in silo-ed relational databases in a structured format. The very existence of data was predicated on the existence of a well defined schema. In the new world of Big Data plaforms, data is often stored without a schema and in some cases the data is a stream of messages in a queueing system. Data Governance tools like Apache Atlas can help with management of data sets and processes that evolve them. The flexibility of Atlas enables creation of new managed Types that can be used to govern data sets form just about any data source. In fact, as of Hortonworks Data Platform 2.5, Atlas is used to visualize and track cross component lineage of data ingested via Apache Hive, Apache Sqoop, Apache Falcon, Apache Storm, Apache Kafka, and in the future, Apache Nifi. Schemas for Hive tables are stored and governed, thus covering many data at rest use cases. It makes a lot of sense to manage schemas for streaming data sources within Atlas as well. Kafka topics are captured as part of Storm topologies but currently, only configuration information is available. The concept of an Avro Schema Registry combined with existing governance capabilities of Atlas, would extend the benefits of data governance to streaming data sets.

In order to extend concept of schema to streaming data sets, a serialization format with a built in the concept of schema is required. Apache Avro is a commonly used serialization format for streaming data. It is extremely efficient for writes and includes self describing schema as part of its specification. Avro schema specification allows for schema evolution that is backward or forward compatible. Each message can be serialized with its schema so that an independent down stream consumer is able to deserialize the message. Instead of the full schema, it is also possible to pass a "fingerprint" that uniquely identifies the schema. This is useful when the schema is very large. However, using a fingerprint with messages that will travel through multiple Kafka topics requires that the consumer is able to reference the schema that the fingerprint refers to. Atlas can be used to not only store Avro schemas but to make them searchable, and useful for data governance, discovery, and security. The first step to using Atlas as an Avro Schema Registry is to add new Types that align to the Avro Schema specification.

Avro Schema supports the following types:

  • Records
  • Enums
  • Arrays
  • Maps
  • Unions
  • Fixed
  • Primitives

Using the Atlas API, it is possible to create types that exhibit the same kinds of attributes and nesting structure. The second required component is a service that is capable of parsing an Avro Schema JSON representation and translating it the new Atlas Avro Types. After registering the schema, the service should return a fingerprint (GUID) that will act as the claim check for that schema on deserialization. The service should also handle schema validation and compatibility enforcement. This set of capabilities would allow automatic deserialization of messages from a Kafka topic.

While just having an Avro Schema Registry is valuable for streaming use cases, using Atlas as the underlying store provides substantial value. Data discovery becomes much easier since all of the fields in each Avro Schema can be individually indexed. This means that a user can search for the name of a field and determine the schema and Kafka topic where it can be found. In many use cases the messages flowing through the Kafka topic flow into a Hive table, HDFS location, or some NoSQL store. Engineering teams can use the cross component lineage visualization in Atlas to understand the effects that schema evolution will have downstream. Atlas also provides the ability to apply tags and business taxonomies. These capabilities make it really easy to curate, understand, and control how streaming data is deployed and secured. For example, Apache Atlas integrates with Apache Ranger (Authorization system) to enable tag based policies. This capability allows column level authorization for data managed by Apache Hive based on tags applied to the meta data in Atlas. Apache Ranger is also currently able to secure Kafka topics based on source IP or user name (in Kerberized clusters). Tag based policies are not yet available for Kafka topics. However, it should be possible to reuse the same tag synch subsystem used to implement tag based policies in Hive. Tags can also be used to ensure to deprecate older schemas or prevent evolution of certain schemas through the Registry API. Finally, because Atlas uses HBase and Solr under the covers, enterprise requirements like HA and DR capabilities do not need to be re-implemented.

It is clear that data governance is becoming absolutely essential component of an enterprise data management platform. Whether the data is streaming or at rest, both business and technology organizations need to discover, understand, govern, and secure that data. Combining capabilities of existing data governance tools like Apache Atlas with schema aware data formats like Apache Avro (Kafka) and Apache ORC (Hive/Pig/Spark), can help managing Big Data that much easier.