Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Atlas Accountability/traceability & REST API Performance vs Kafka

avatar
Expert Contributor

Atlas is a governance tool. Two of the key pillars of data governance are accountability and meeting compliance requirements. To establish accountability and traceability, tools usually support lineage information. This helps answering questions like where did the data come from, who modified it and how was it modified etc. Compliance requirements for industries like healthcare and the finance industry can be very strict. Origins of the data are required to be known with any ambiguity. Since Atlas claims to help organizations meet their compliance requirements, consider the scenario presented in the attached figure.

lineage-accountability.png

In the figure we notice a process reads a few data items and then writes them to two different Databases. Atlas can capture cross component lineage and will capture the inputs and the outputs of the process.

  1. How can we determine what input went to what database? There can be a situation where all records from data item 1 are written to database two and the remaining two data items are written to database 1. In such a case, I have ambiguity in the lineage. All I would know is that the data could be from any of the data sources. Will such information be enough to meet compliance requirements?
  2. The second question I have is regarding performance. Currently Kafka does not support Atlas V2. Therefore when developing the Spark Atlas addon, I used the RESP API to post the entities. Since I am also handling Spark Streaming, in such a case the number of entity notifications can be high. Can I run into scalability issues in such a scenario? Approximately what rate can the REST API handle before messages are dropped?

Thanks in advance

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Arsalan Siddiqi

Thanks the excellent question. Your observations are valid.

While Atlas does help with meeting compliance requirements, it is only part of the solution. To use traffic analogy, Atlas is the map (hence the name) and does not deal the cars on the road (traffic).

To complete the picture, there needs to be some monitoring on what data gets ingested in the system and if all the data conforms with the norms setup. Please take a look at this presentation from Data Summit 2017. It explains how a system can be setup which helps with governance (realm of Atlas) and then also helps with spotting errors within the data itself.

To summarize, to be able to spot errors with flow of data itself, you would need some other mechanism. Atlas will not help you in that respect.

About your 2nd question: Atlas consumes notifications from Kafka by spawning a single thread and processing 1 notification at a time (see NotificationHookConsumer.java & AtlasKafkaConsumer.java). In case of systems with high throughput, the notifications will be queued with Kafa and you will see a lag in consumption of notifications.

Kafka guarantees durability of messages. Atlas ensures that it consumes every message produced by Kafka. If messages are dropped for some reason, you would see them in Atlas' logs. We also test Atlas in high available scenarios.

Also, to address the notification message question, I would urge you to use Atlas V2 client APIs (both on master and branch-0.8). Kafka does not mandate any message format since all it understands is bytes, so that should not be a determining criteria for choosing the client API version.

I know this is a lot of text, I hope it helps. Please feel free to reach out if you need clarifications.

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

@Ashutosh Mestry any thoughts?

avatar
Expert Contributor

@Arsalan Siddiqi

Thanks the excellent question. Your observations are valid.

While Atlas does help with meeting compliance requirements, it is only part of the solution. To use traffic analogy, Atlas is the map (hence the name) and does not deal the cars on the road (traffic).

To complete the picture, there needs to be some monitoring on what data gets ingested in the system and if all the data conforms with the norms setup. Please take a look at this presentation from Data Summit 2017. It explains how a system can be setup which helps with governance (realm of Atlas) and then also helps with spotting errors within the data itself.

To summarize, to be able to spot errors with flow of data itself, you would need some other mechanism. Atlas will not help you in that respect.

About your 2nd question: Atlas consumes notifications from Kafka by spawning a single thread and processing 1 notification at a time (see NotificationHookConsumer.java & AtlasKafkaConsumer.java). In case of systems with high throughput, the notifications will be queued with Kafa and you will see a lag in consumption of notifications.

Kafka guarantees durability of messages. Atlas ensures that it consumes every message produced by Kafka. If messages are dropped for some reason, you would see them in Atlas' logs. We also test Atlas in high available scenarios.

Also, to address the notification message question, I would urge you to use Atlas V2 client APIs (both on master and branch-0.8). Kafka does not mandate any message format since all it understands is bytes, so that should not be a determining criteria for choosing the client API version.

I know this is a lot of text, I hope it helps. Please feel free to reach out if you need clarifications.