Created 11-29-2017 12:06 PM
Atlas is a governance tool. Two of the key pillars of data governance are accountability and meeting compliance requirements. To establish accountability and traceability, tools usually support lineage information. This helps answering questions like where did the data come from, who modified it and how was it modified etc. Compliance requirements for industries like healthcare and the finance industry can be very strict. Origins of the data are required to be known with any ambiguity. Since Atlas claims to help organizations meet their compliance requirements, consider the scenario presented in the attached figure.
In the figure we notice a process reads a few data items and then writes them to two different Databases. Atlas can capture cross component lineage and will capture the inputs and the outputs of the process.
Thanks in advance
Created 12-04-2017 06:05 PM
Thanks the excellent question. Your observations are valid.
While Atlas does help with meeting compliance requirements, it is only part of the solution. To use traffic analogy, Atlas is the map (hence the name) and does not deal the cars on the road (traffic).
To complete the picture, there needs to be some monitoring on what data gets ingested in the system and if all the data conforms with the norms setup. Please take a look at this presentation from Data Summit 2017. It explains how a system can be setup which helps with governance (realm of Atlas) and then also helps with spotting errors within the data itself.
To summarize, to be able to spot errors with flow of data itself, you would need some other mechanism. Atlas will not help you in that respect.
About your 2nd question: Atlas consumes notifications from Kafka by spawning a single thread and processing 1 notification at a time (see NotificationHookConsumer.java & AtlasKafkaConsumer.java). In case of systems with high throughput, the notifications will be queued with Kafa and you will see a lag in consumption of notifications.
Kafka guarantees durability of messages. Atlas ensures that it consumes every message produced by Kafka. If messages are dropped for some reason, you would see them in Atlas' logs. We also test Atlas in high available scenarios.
Also, to address the notification message question, I would urge you to use Atlas V2 client APIs (both on master and branch-0.8). Kafka does not mandate any message format since all it understands is bytes, so that should not be a determining criteria for choosing the client API version.
I know this is a lot of text, I hope it helps. Please feel free to reach out if you need clarifications.
Created 12-03-2017 02:18 PM
@Ashutosh Mestry any thoughts?
Created 12-04-2017 06:05 PM
Thanks the excellent question. Your observations are valid.
While Atlas does help with meeting compliance requirements, it is only part of the solution. To use traffic analogy, Atlas is the map (hence the name) and does not deal the cars on the road (traffic).
To complete the picture, there needs to be some monitoring on what data gets ingested in the system and if all the data conforms with the norms setup. Please take a look at this presentation from Data Summit 2017. It explains how a system can be setup which helps with governance (realm of Atlas) and then also helps with spotting errors within the data itself.
To summarize, to be able to spot errors with flow of data itself, you would need some other mechanism. Atlas will not help you in that respect.
About your 2nd question: Atlas consumes notifications from Kafka by spawning a single thread and processing 1 notification at a time (see NotificationHookConsumer.java & AtlasKafkaConsumer.java). In case of systems with high throughput, the notifications will be queued with Kafa and you will see a lag in consumption of notifications.
Kafka guarantees durability of messages. Atlas ensures that it consumes every message produced by Kafka. If messages are dropped for some reason, you would see them in Atlas' logs. We also test Atlas in high available scenarios.
Also, to address the notification message question, I would urge you to use Atlas V2 client APIs (both on master and branch-0.8). Kafka does not mandate any message format since all it understands is bytes, so that should not be a determining criteria for choosing the client API version.
I know this is a lot of text, I hope it helps. Please feel free to reach out if you need clarifications.