I attended a panel discussion at Hadoop World last week (September 2015) and
I’d like to share my notes for others.
The topics of interest were: the future of data governance, who are the
actors in data governance, and what are the tools to use. Full details of the talk are here: http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43493
One early question was on what the future holds for data
governance in Hadoop. An interesting answer was that collecting data in
the future will require greater consent from individuals. This echoed the
morning’s keynote speaker who urged the same practice. The theme of that
keynote was that too much data is being collected right now, mostly without the
knowledge of the individual, and eventually there will be negative consequences
for all. Back on the panel, another answer was that data governance of the
future will help data analysts make sense of the data inside of their data
lake. For example, audit logs will be automatically parsed to determine what
columns to use when joining two datasets.
Personally, I wish I had this at a prior company where I worked with datasets
created by others that lacked details of how to join them together. The last prediction of the future, suggested
that we will use metadata to automatically tune the way that a cluster
operates. For example: when processing a dataset that is tagged as high
priority extra capacity would be allocated. I know that Cloudbreak and Falcon are well on their
way to doing exactly this, and I am hopeful, that other tools will follow.
At one point a panelist mentioned that 90% of data
governance is about people and 10% involves technology, which sounds counter
intuitive at first. As I thought about
it, I realized that at another talk earlier in the day, called “Goldman Sachs
Data Lake”, Billy Newport implicitly made the same point. In describing the flow of data in and out of
the lake, Billy talked about the many human roles involved: data curators to
tag data, data owners to approve consumption, auditors to review logs, and
engineers to deal with failures.
After the discussion, I asked a panelist about the
specific tools they recommend to solve data governance problems. Apache Atlas, Waterline and Alation were mentioned. It is the first time that I heard of Alation,
I know Atlas well, and I saw an impressive demo of the product offered by
Waterline previously. The Waterline offering
executes a Map Reduce job to find and analyze text files within a cluster and presents
a UI with lineage, taxonomy, and statistics.
A data quality analyst can use the UI to find anomalies, see sample
data, and track the problem to the source. For example, during the demo, a popup next
to lastUpdated column showed that it contained mostly date values, but also a
few integers. Another useful feature was
for a data curator: they can use the UI to tag datasets and organize those tags
into taxonomies through a drag-and-drop interface.
These are some of the highlight from a thought-provoking
discussion about the application of data governance principles to Hadoop.