Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to manage Metadata for Files in HDFS?

avatar
New Contributor

We have files being pushed into HDFS using curl and WebHDFS interface. Files mostly contain structured data and there is lots of metadata available around fields, data type, file description etc. at the time of ingestion. There is no specific requirement to add these files to Hive.

As per my understanding Atlas is more aligned with Sqoop/Hive based data ingestion. Can we add the file specific metadata in Atlas, some how at the time ingestion using curl/WebHDFS? Example: HDFS location, file name, fields, datatype etc.

An alternate might be to use Falcon and using free form tags to the metadata but that would require changes to the way data is currently ingested, unless we can schedule a curl script in Falcon.

Any ideas or suggestions... Thanks!

1 ACCEPTED SOLUTION

avatar

@Amit Jain - As mentioned by @bsaini HDFS is not officially part of Atlas current roadmap. It will be good to raise and jira and get votes for it. It is the only way to push this.

While this happens as a community member you can always write your own types. Remember, Atlas has this open API to create your own type system to model anything you want.

I have created a small utility based on this, called Atlas CLI.

https://github.com/shivajid/atlas/tree/master/codesamples/atlas

A good code examples that the developers always work with is the QuickStart.java

https://github.com/apache/incubator-atlas/blob/master/webapp/src/main/java/org/apache/atlas/examples...

IHTH

View solution in original post

5 REPLIES 5

avatar
Contributor

Not a perfection option, but here's something you might want to consider, depending on how you're receiving that metadata. You could write a script that parses the incoming metadata and formats it into something that you could post to an internal wiki using its API. For example, if you had some kind of data dictionary and a file description, you could use that to create a wiki page that includes the description at the top, notes about when the data and metadata were last updated, and then a table of columns, data types, and definitions. Using Semantic Mediawiki is a great way to make that metadata structured and easily searchable/reportable in the wiki. Certainly a user friendly (not necessarily programmer friendly) solution.

avatar

@Amit Jain

Its seems to be a no-brainer for HDFS metadata to be part of Atlas and I am hopeful sometime in future it will be. However, it does not seem to be on the immediate roadmap. I see there is a patch available in community that needs more work.

https://issues.apache.org/jira/browse/ATLAS-164

So, here are your options as of today -

1) Use a partner product. Here is a product that works with HDFS.

http://www.waterlinedata.com/prod

Here is an article that explains in more details - http://hortonworks.com/hadoop-tutorial/manage-your-data-lake-more-efficiently-with-waterline-data-in...

2) Build a custom solution for your environment. If I was solving this issue, I would do the following -

One time setup- 
1- Create a HBase table (using Phoenix) to store the file name and other metadata attributes as needed. There should be a status column in this table. (HDFS_METADATA)

Changes to the script that ingests the data - 
1-Run a query to upsert SQL to add an entry to HDFS_METADATA table with the status = P (Pending) 
2-Copy the file to HDFS
3-Run another query to update the status to C (Complete)

This HBase table can be used for querying metadata for any file.

Here is a visualization tool that lets you see the HDFS disk usage. If you go down the path of building someting custom, you may be able make use of this to make the output really interesting.

https://github.com/tarnfeld/hdfs-du

Hope this helps.

avatar

@Amit Jain - As mentioned by @bsaini HDFS is not officially part of Atlas current roadmap. It will be good to raise and jira and get votes for it. It is the only way to push this.

While this happens as a community member you can always write your own types. Remember, Atlas has this open API to create your own type system to model anything you want.

I have created a small utility based on this, called Atlas CLI.

https://github.com/shivajid/atlas/tree/master/codesamples/atlas

A good code examples that the developers always work with is the QuickStart.java

https://github.com/apache/incubator-atlas/blob/master/webapp/src/main/java/org/apache/atlas/examples...

IHTH

avatar
Contributor

Download Atlas jars link is not working. Can you please verify once?

avatar
Master Mentor

@Amit Jain has this been resolved? Can you post your solution or accept best answer?