Community Articles

vzlatkin · ‎10-06-2015

I attended a panel discussion at Hadoop World last week (September 2015) and I’d like to share my notes for others. The topics of interest were: the future of data governance, who are the actors in data governance, and what are the tools to use. Full details of the talk are here: http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43493

One early question was on what the future holds for data governance in Hadoop. An interesting answer was that collecting data in the future will require greater consent from individuals. This echoed the morning’s keynote speaker who urged the same practice. The theme of that keynote was that too much data is being collected right now, mostly without the knowledge of the individual, and eventually there will be negative consequences for all. Back on the panel, another answer was that data governance of the future will help data analysts make sense of the data inside of their data lake. For example, audit logs will be automatically parsed to determine what columns to use when joining two datasets. Personally, I wish I had this at a prior company where I worked with datasets created by others that lacked details of how to join them together. The last prediction of the future, suggested that we will use metadata to automatically tune the way that a cluster operates. For example: when processing a dataset that is tagged as high priority extra capacity would be allocated. I know that Cloudbreak and Falcon are well on their way to doing exactly this, and I am hopeful, that other tools will follow.

At one point a panelist mentioned that 90% of data governance is about people and 10% involves technology, which sounds counter intuitive at first. As I thought about it, I realized that at another talk earlier in the day, called “Goldman Sachs Data Lake”, Billy Newport implicitly made the same point. In describing the flow of data in and out of the lake, Billy talked about the many human roles involved: data curators to tag data, data owners to approve consumption, auditors to review logs, and engineers to deal with failures.

After the discussion, I asked a panelist about the specific tools they recommend to solve data governance problems. Apache Atlas, Waterline and Alation were mentioned. It is the first time that I heard of Alation, I know Atlas well, and I saw an impressive demo of the product offered by Waterline previously. The Waterline offering executes a Map Reduce job to find and analyze text files within a cluster and presents a UI with lineage, taxonomy, and statistics. A data quality analyst can use the UI to find anomalies, see sample data, and track the problem to the source. For example, during the demo, a popup next to lastUpdated column showed that it contained mostly date values, but also a few integers. Another useful feature was for a data curator: they can use the UI to tag datasets and organize those tags into taxonomies through a drag-and-drop interface.

These are some of the highlight from a thought-provoking discussion about the application of data governance principles to Hadoop.

Cloudera Community

Community Articles

Notes on Big Data Governance

Apache Atlas

Apache Falcon

Data Governance on Cloudera

Performance Monitoring In Big Data Hadoop

Cloud in a Big Data world

Customizing Atlas (Part1): Model governance, trace...

Test Driven Development for Big Data (Unofficial G...

How To: Store Zeppelin Notes in GitHub repo

Big Data DevOps: Apache NiFi - HWX Schema Registry...

Add Custom properties to existing Atlas Types in s...

Automated provisioning of HDP for Data Governance ...

Big Data Processing Engines, The Technical Series ...