Community Articles

dconnolly · ‎09-01-2016

Self Service Hadoop – well some starting points

Say you want to get started with Big Data and concurrently want to start to empower your relatively savvy end users that have been in the frustrating desktop data management land for a long time. This very simple article will hopefully help with a few options.

This diagram below gives a little perspective around some of the sources and mechanisms around ingesting, manipulating and using your data assets. You can see following the numbers from 1 to 6 that you have many options with working with your data. This article will concentrate on just showing a simple end user example of ingesting data, understanding it and using options for using the data from a self service approach. Really an approach to get started and help promote some of the value of a modern data architecture as your team matures.

An end users want to self service some data from their desktop/server into HDFS and be able to query and understand that data from their existing tools as well as work with it in conjunction with what tech staff is ingesting using other vehicles. This quick example will show how to use the Ambari Hive view to upload data, provide some structure, and create a Hive table that can be used by many available tools. Will also give a very brief starting thought around how Atlas can be used to help organize and track the what, where, how, etc. around your assets.

1.Go to Ambari Hive view – right side of Ambari dashboard on the top lists the views when you click on the table looking icon. ( There are also other views for HDFS file view, Zeppelin, etc.)

Here is the Where you select Ambari View

2.Once in the Ambari view, you can click on the upload table tab.

This is what the Ambari View looks like, lot of options here, some more tech focused, but very functional.

3.Within that tab you can select a CSV, with or without headers, from local storage or HDFS.

4.Then you can change the column names and/or types if necessary.

5.Then you create the hive table, in the Hive database you want.

This is the Table tab where I selected a CSV (geolocation) from my hard drive, it had headers

6.Once the Hive table is created you can use any third party tool (tableau), ambari hive view, excel, zeppelin, etc. to work with the table.

Here is the Hive table geolocation (stored in ORC format) in default Hive Database queried in hive view

7.Ok, one more detail that may help you. Once the geolocation table is created from the Hive View upload, there is no reason why you cannot go out and tie it into a taxonomy in Atlas, tag columns, add details, see lineage, etc. Few screen prints to give perspective. This is a larger topic, but will help locate, organize, secure, and track data assets for the team.

Bottom part of atlas screen.

A good understanding of the latest Atlas release can be found in the Hadoop Summit presentations listed below.

Atlas – three session at Hadoop summit that will help. This is the link to all the sessions if interested. http://hadoopsummit.org/san-jose/agenda/

a. What the #$* is a Business Catalog and why you need it

Video - https://www.youtube.com/watch?v=BtAkztkcZwU

Slides - http://www.slideshare.net/HadoopSummit/what-the-is-a-business-catalog-and-why-you-need-it

‪b. Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

Video - https://www.youtube.com/watch?v=ID6qnoLCQzk

Slides - http://www.slideshare.net/HadoopSummit/top-three-big-data-governance-issues-and-how-apache-atlas-res...

‪c. Extend Governance in Hadoop with Atlas Ecosystem

Video - https://www.youtube.com/watch?v=7nx6hzhM4Xs

Slides - http://www.slideshare.net/HadoopSummit/extend-governance-in-hadoop-with-atlas-ecosystem-waterline-at...