Support Questions

Find answers, ask questions, and share your expertise

is it possible to expose dataset sample data(exp first 10 rows) using atlas?

avatar
New Contributor

Hello,

i am new to atlas and i am wondering if there is a solution making atlas expose, in addition to a dataset metadata, a sample data of this dataset ( 10 first rows for example).

for a purpose of data governance, the problem i am solving is that the metadata could not be so comprehensive functionally, and reading a sample data will make more sens to explain a dataset content.

if , actually, there is not a solution please help me find were should i invest my effort ( inner development within atlas, or treat the problem using a third party accessing hive,scoop or hbase directly to get the sample data)

thanks in advance.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

There are two ways I'd suggest off the top of my head.

  1. The simplest might be to add a metadata tag that would contain a small sample in CSV, JSON or HTML format, and populate it via the API or Atlas Kafka topics. For example, you could use HDF NiFi to periodically sample each table in the Hive, format the data, and populate the attribute.
  2. For a more integrated approach, you might consider using the DataPlane DSS data profiler framework to add a "sample" profile that could be stored alongside the other profiler metadata and surfaced in DSS.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

There are two ways I'd suggest off the top of my head.

  1. The simplest might be to add a metadata tag that would contain a small sample in CSV, JSON or HTML format, and populate it via the API or Atlas Kafka topics. For example, you could use HDF NiFi to periodically sample each table in the Hive, format the data, and populate the attribute.
  2. For a more integrated approach, you might consider using the DataPlane DSS data profiler framework to add a "sample" profile that could be stored alongside the other profiler metadata and surfaced in DSS.

avatar
New Contributor

Thank you for your reply, i think i would go for the first option making a metadata tag for each dataset.

avatar
Expert Contributor