Member since
03-24-2016
184
Posts
239
Kudos Received
39
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3248 | 10-21-2017 08:24 PM | |
| 1964 | 09-24-2017 04:06 AM | |
| 6477 | 05-15-2017 08:44 PM | |
| 2205 | 01-25-2017 09:20 PM | |
| 7285 | 01-22-2017 11:51 PM |
08-15-2016
03:47 AM
6 Kudos
This article is a companion to the article "Avro Schema Registry with Apache Atlas for Streaming Data Management". https://community.hortonworks.com/articles/51379/avro-schema-registry-with-apache-atlas-for-streami.html The article explores how an Avro schema registry can bring data governance to streaming data and the benefits that come with it. This tutorial demonstrates the implementation of this concept and some of the resulting features.
Download HDP 2.5 Sandbox modify the hosts file on the local machine to resolve sandbox.hortonworks.com to 127.0.0.1 SSH to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222) ambari-admin-password-reset make sure to set the Ambari password to "admin" Log into Ambari and start the following services (http://sandbox.hortonworks.com:8080/)
HBase Log Search Kafka Atlas From the SSH console:
git clone https://github.com/vakshorton/AvroSchemaShredder cd /root/AvroSchemaShredder chmod 755 install.sh ./install.sh java -jar AvroSchemaShredder-jar-with-dependencies.jar Open a second SSH session to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222) cd /root/AvroSchemaShredder curl -u admin:admin -d @schema/schema_1.json -H "Content-Type: application/json" -X POST http://sandbox.hortonworks.com:8090/schemaShredder/storeSchema
Curl will make a REST API call to the AvroSchemaShredder service to parse the sample Avro schema and store it in Atlas. Log into Atlas: http://sandbox.hortonworks.com:21000 (usr:admin, pass:admin) Search for "avro_schema". The search should return a list of schemas that were created when the request to register schemas was made via the REST service call.
Click into one of the schemas, notice the available information about the top level record The record will have a "fields" attribute that contains links to other sub elements and in some cases, other schemas Now any of the fields of any registered schema can be searched and tagged. Schemas can be associated with Kafka topics allowing discovery of streaming data sources on those topics. Also, notice that the curl REST call returned a GUID. That GUID can be used to access the schema that was registered. This means that a message can be automatically deserialized from a Kafka topic based on the "fingerprint" associated to the message on the Kafka topic. This could be achieved using a standard client that depends on the Avro Schema Registry to deserialize messages. To retrieve the Avro compliant schema notation: get the GUID that the curl command returned after the sample schema was registered curl -u admin:admin -X GET http://sandbox.hortonworks.com:8090/schemaShredder/getSchema/{GUID}
The response should be an Avro compliant schema descriptor This prototype does not handle schema validation or compatibility enforcement. It also does not do any caching to optimize performance or leverage Kafka for asynchronous notification. However, it does demonstrate how the described capabilities can be achieved. Repo: https://community.hortonworks.com/content/repo/51366/avro-schema-shredder.html
... View more
Labels:
08-13-2016
08:59 PM
6 Kudos
Data is the becoming the new precious resource. As the world produces more and more data,
business units find increasingly more ways to monetize that data. This means that data that
used to be retained for a short time or not at all, is now being persisted long term. This
data is being gathered from more and more sources and not necessarily from within the
organization that uses it. It is also increasingly being generated by machines, meaning
that the volume, velocity, and variety of the data proliferate at an overwhelming rate.
There are now lots of tools that enable an organization to address the challenges imposed by
the proliferation of data. However, many organizations have been focused on dealing with
volume and velocity while not focusing on the challenges created by the lack or
inconsistency of structure. In order to truly unlock the power of all that data, an organization must first apply a
consistent set of guidelines for governance of the data. Getting value from new data
sources often requires imposing schemas on unstructured or semi-structured data.
This is because the new data often has to be combined with existing structured data in order for it
to be useful. Schemas can also be important for security as sensitive bits of data are
often mixed in data sets that are generally considered non-sensitive. Finally, business
units generally do not create the technologies that monetize the data. That job falls to
many different engineering groups that are often decentralized. In order to effectively
create the tools that enable harvesting value from data, engineering teams need to agree on
how that data should be used, modified, and enriched. Consider a scenario where two
different engineering teams are working on requirements from two different business units
and have no knowledge of the other's work. When team A wants to evolve the schema of some
data set, they must be sure that the change will not disrupt the work of team B. This is
challenging since team A may not know that team B is using the same data or what they are
doing with it. In addition, team B will likely derive a new data set from the existing
data. That new data set may be exactly what team A needs to deliver what the business has
asked for. Team A needs to be able to discover the fact that team B has produced a new data
set from the one that both teams were using. It used to be that data was primarily stored in silo-ed relational databases in a
structured format. The very existence of data was predicated on the existence of a well defined schema.
In the new world of Big Data plaforms, data is often stored without a schema and in some cases
the data is a stream of messages in a queueing system. Data Governance tools like
Apache Atlas can help with management of data sets and processes that evolve them. The flexibility of
Atlas enables creation of new managed Types that can be used to govern
data sets form just about any data source. In fact, as of Hortonworks Data Platform 2.5, Atlas is used to visualize
and track cross component lineage of data ingested via Apache Hive, Apache Sqoop, Apache Falcon,
Apache Storm, Apache Kafka, and in the future, Apache Nifi. Schemas for Hive tables are stored and
governed, thus covering many data at rest use cases. It makes a lot of sense to manage schemas for streaming
data sources within Atlas as well. Kafka topics are captured as part of Storm
topologies but currently, only configuration information is available. The concept of an Avro Schema Registry
combined with existing governance capabilities of Atlas, would extend the benefits of data governance
to streaming data sets.
In order to extend concept of schema to streaming data sets, a serialization format with a built in the concept of schema is required.
Apache Avro is a commonly used serialization format for streaming data.
It is extremely efficient for writes and includes self describing schema as part of its specification.
Avro schema specification allows for schema evolution that is backward or forward compatible.
Each message can be serialized with its schema so that an independent down stream
consumer is able to deserialize the message. Instead of the full schema, it is also possible
to pass a "fingerprint" that uniquely identifies the schema. This is useful when the
schema is very large. However, using a fingerprint with messages that will travel through
multiple Kafka topics requires that the consumer is able to reference the schema that the
fingerprint refers to. Atlas can be used to not only store Avro schemas but to make them
searchable, and useful for data governance, discovery, and security.
The first step to using Atlas as an Avro Schema Registry is to add new Types that align to
the Avro Schema specification. Avro Schema supports the following types: Records Enums Arrays Maps Unions Fixed Primitives Using the Atlas API, it is possible to create types that exhibit the same kinds of attributes
and nesting structure. The second required component is a service that is capable of parsing
an Avro Schema JSON representation and translating it the new Atlas Avro Types. After registering
the schema, the service should return a fingerprint (GUID) that will act as the claim check for that schema on deserialization.
The service should also handle schema validation and compatibility enforcement. This set
of capabilities would allow automatic deserialization of messages from a Kafka topic. While just having an Avro Schema Registry is valuable for streaming use cases, using Atlas
as the underlying store provides substantial value. Data discovery becomes much easier
since all of the fields in each Avro Schema can be individually indexed. This means that a user
can search for the name of a field and determine the schema and Kafka topic where it can be found.
In many use cases the messages flowing through the Kafka topic flow into a Hive table,
HDFS location, or some NoSQL store. Engineering teams can use the cross component lineage
visualization in Atlas to understand the effects that schema evolution will have downstream.
Atlas also provides the ability to apply tags and business taxonomies. These capabilities
make it really easy to curate, understand, and control how streaming data is deployed and secured.
For example, Apache Atlas integrates with Apache Ranger (Authorization system) to enable tag based
policies. This capability allows column level authorization for data managed by Apache Hive
based on tags applied to the meta data in Atlas. Apache Ranger is also currently able to secure
Kafka topics based on source IP or user name (in Kerberized clusters). Tag based policies
are not yet available for Kafka topics. However, it should be possible to reuse the same
tag synch subsystem used to implement tag based policies in Hive. Tags can also be used
to ensure to deprecate older schemas or prevent evolution of certain schemas through the Registry API.
Finally, because Atlas uses HBase and Solr under the covers, enterprise requirements like HA
and DR capabilities do not need to be re-implemented. It is clear that data governance is becoming absolutely essential component of an enterprise
data management platform. Whether the data is streaming or at rest, both business and
technology organizations need to discover, understand, govern, and secure that data. Combining
capabilities of existing data governance tools like Apache Atlas with schema aware data formats
like Apache Avro (Kafka) and Apache ORC (Hive/Pig/Spark), can help managing Big Data that
much easier.
... View more
Labels:
07-30-2016
01:40 AM
1 Kudo
@mrizvi As you know HDP 2.5 is a tech preview of an HDP version that is approaching release.Thus, the included bits only cover the use cases outlined in the tutorials that come with it. https://github.com/hortonworks/tutorials/tree/hdp-2.5/tutorials/hortonworks You can preview this feature using a different Sandbox and tutorial. http://hortonworks.com/hadoop-tutorial/cross-component-lineage-apache-atlas/ This is an earlier version of Atlas but you can get a feel for how the Cross Component Lineage capability will work in HDP 2.5.
... View more
07-29-2016
05:46 PM
@Manoj Dhake The best official explanation is not very detailed. http://atlas.incubator.apache.org/TypeSystem.html It's basically just: ### Traits is similar to scala - traits more like decorators (?) - traits get applied to instances - not classes - this satisfies the classification mechanism (ish) - can have a class instance have any number of traits - e.g. security clearance - any Person class could have it; so we add it as a mixin to the Person class - security clearance trait has a level attribute - traits are labels - each label can have its own attribute - reason for doing this is: - modeled security clearance trait - want to prescribe it to other things, too - can now search for anything that has security clearance level = 1, for instance ### On Instances: - class, trait, struct all have bags of attributes - can get name of type associated with attribute - can get or set the attribute in that bag for each instance If this answers your question, would you mind accepting the answer? Also, I provided answers to many of the other questions you asked over the last couple of weeks. Could you check the answers and accept if they were helpful or let me know what else I can clarify? I want to make sure I answered your questions. Thanks in advance.
... View more
07-26-2016
07:53 PM
@Vivek Dhagat 1. The Atlas data model is located here: http://atlas.incubator.apache.org/TypeSystem.html See this thread for examples of how to create entities: https://community.hortonworks.com/questions/41409/how-to-use-the-function-of-data-classification-of.html See this tread for examples of how to create traits (tags) https://community.hortonworks.com/questions/33501/how-to-create-attribute-sets-and-collections-using.html 2. Try the Cross Component Lineage lab outline here http://hortonworks.com/hadoop-tutorial/cross-component-lineage-apache-atlas/ This will run a Sqoop job to import data from MySQL into Hive. That should create a series of meta data entities including the entity that represents the MySQL table, the Sqoop process, the Hive Table, and all of the lineage that was involved with the import process. This capability will be available in HDP 2.5.
... View more
07-26-2016
07:37 PM
@Manoj Dhake You can let tags (traits) inherit from other tags just like class types can inherit from other class types (class types are what entities are based on). So you would define The Medical tag: {
"enumTypes": [],
"structTypes": [],
"traitTypes": [
{
"superTypes": [],
"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.TraitType",
"typeName": "Medical",
"typeDescription": "",
"attributeDefinitions": []
}
],
"classTypes": []
}
} Then define the child tags with Medical as the SuperType: {
"enumTypes": [],
"structTypes": [],
"traitTypes": [
{
"superTypes": ["Medical"],
"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.TraitType",
"typeName": "Person",
"typeDescription": "",
"attributeDefinitions": []
},
{
"superTypes": ["Medical"],
"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.TraitType",
"typeName": "Drugs",
"typeDescription": "",
"attributeDefinitions": []
},
{
"superTypes": ["Medical"],
"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.TraitType",
"typeName": "PII",
"typeDescription": "",
"attributeDefinitions": []
}
],
"classTypes": []
}
} Now when you assign PII, Drugs, or Person tag to an entity like field, it will also inherit the Medical tag and it's associated attributes. To answer your second question, tag attributes can be defined during assignment time to define some value related to the tag. Let's say you wanted to not only apply the Drug tag but also define the drug name for that patient. You could add a DrugName attribute to the Drug traitType (UI or API). When you go to assign the Drug tag to a patient in the UI, you will also be able to define the attributes that are defined as part of the tag. So in this case, you would assign the Drug tag and then click define attribute and enter the value "Asprin" or something like that.
... View more
07-26-2016
07:06 PM
@Russell AndersonSorry it took a while to respond, I have not been back to this question ina while. Could you ask the questions in a separate thread so that the answer is easier to find for others? I will do my best to answer.
... View more
07-22-2016
06:43 PM
@Manoj Dhake When you create a new Trait/Tag in the UI are are actually creating a new type. When you associate it with an entity you are creating an instance of the Trait as a STRUCT type that only exists in the context of that entity. As you know, you can delete instances of a tag that is assigned to an entity. However, to delete the tag so that it does not show up in Atlas UI you would need to delete the type. At the moment, types cannot be deleted since there may be entities or even other types that are dependent on them. You probably can delete the trait directly from HBase, however, you may also compromise the integrity of the entire Atlas system. You would also have to clear any dependents that are indexed in Solr or graphed in Titan. I am sure these are all issues that the community will have to solve in dealing implemeting the JIRA you referenced. To answer the follow-up question you asked in the comments sections, you can get the instance of a trait only from the entity where the trait is associated as follows: curl -u admin:admin -X GET http://sandbox.hortonworks.com:21000/api/atlas/entities/fadaca14-7e58-4a2e-b04d-95a3010ce45b/traits
{"requestId":"qtp921483514-312 - bb0f1276-c8fc-499b-ab5d-0e4fde11c796","results":["PII"],"count":1}[root@sandbox ~]# As you can see, this entity has a tag called PII and it does not have a GUID because it is STRUCT based on the PII type that represents PII trait. You can easily delete the STRUCT that is associated with an entity through the UI or REST api but there is now way to remove the type from which the trait is spawned for the reasons listed above.
... View more
07-19-2016
04:22 PM
Could you please ask the questions in separate thread with a few more details on what you are trying to achieve? I will do my best to answer/
... View more
07-11-2016
03:03 PM
5 Kudos
For a long time, when there was a big job to do, people relied on horses. Whether the job required heavy pulling, speed, or anything in between.
However, not all horses were fit for every task. Certain breeds were valued for their incredible speed and endurance, especially when an important
letter had to be delivered. Others were prized for their ability to carry and pull large payloads whether it was a fully armored knight or huge stone blocks.
Today we rarely rely on horses and much more on technology to get many of the same kinds of jobs done.
Very high volume streaming data is increasingly common in all lines of business because of the value and utility that it often carries.
Horses will be of little help with this type of workload but luckily there is a whole host of tools to deal with streaming data. However, just like with
horses, choosing the the right streaming tool for a particular use case is critical to the success of the project. Consider a use case where a directory full of log files with log entries, need to be broken down
into individual events, filtered/altered, turned into JSON, and then sent on to a Kafka topic. This use case has exactly the kind of requirements that Apache Nifi was designed for. All you would have to do is string together ListHDFS, FetchHDFS, SplitText, ExtractText, AttributesToJSON, and finally PutKafka processor. This Nifi flow would distributed each file in the target directory across the Nifi cluster, extract/alter the events, output each event as JSON, and send them to a Kafka topic. Notice that not a single line of code was required to solve the use case. The same use case can be solved using Spark Streaming but would require a lot of code and an intricate understanding of how and where Spark stores and processes DStreams. This article : http://allegro.tech/2015/08/spark-kafka-integration.html does a great job of outlining how to achieve the same result but required several iterations by a team of engineers familiar with the Spark Streaming. The article explains the importance of understanding which instructions will execute on the driver and which will execute on the executors. It also describes an elegant approach that uses a factory pattern to distribute uninstantiated Kafka producer templates and how to make sure that the templates are only instantiated by the executors, thus avoiding the dreaded "Not Serializable" exception. That is a lot of work to solve such a basic use case. Spark is one of the leading tools for complex computation and aggregation on very large volumes of data. It is extremely well suited for machine learning, time series/stateful calculation, aggregation, graph processing, and iterative computation. However, due to its highly distributed nature, simple event processing that only requires event routing, data transformation, data enrichment, and data orchestration is harder to achieve. Conversely, Apache Nifi is not the right tool to solve most of the complex computation and processing use cases. This is because it was designed to reduce the amount of effort required create, manipulate, and control streams of events as dynamically as possible. It is a graphical UI based distributed framework that is easy to extend and can perform most of the simple event processing tasks out of the box with very little effort or prior experience required. However, there are many enterprise class use cases that are best solved by using both Spark and Nifi together. Consider one more example where a large organization dealing with millions of IOT enabled devices across the country needs to apply predictive algorithms on aggregated data in near realtime. They will need all of the event data to eventually arrive at two or three (in some cases one) processing centers in order to achieve their goals. At the same time, they need to make sure that lost events due to outages are as limited as possible. Such an organization will have many smaller data centers that have small infrastructure foot prints throughout the country. It is not practical to put a Spark Streaming cluster in those data centers nor is it safe to just point all of the devices at one or two data centers with heavy processing footprints. One of the possible approaches could be to run Nifi at the smaller data centers spread across a handful of servers to capture and curate the local event streams. Those streams can then be forwarded as cleaned and secured data sources to the two or three large data centers with large Spark Streaming clusters to apply all of the heavy and complex processing. This approach would address the need for failure group reduction/isolation, minimize the effort required to manage the distributed data collection infrastructure, allow dynamic updates to event manipulation and routing logic, and provide all of the heavy processing capabilities required to apply the intelligence and reporting required to address the business requirements. To conclude our metaphor, Apache Spark Streaming is the heavy war horse and Apache Nifi is the quick and nimble race horse. Some workloads are best suited for one, some for the other, and some will require both working together.
... View more
Labels: