Created on 09-07-2017 01:58 PM - edited 08-17-2019 11:17 AM
A quick tutorial on how to mask columns in Hive for regulatory purposes
This tutorial will cover how to apply tags to Atlas entities and subsequently leverage tagging policies in Ranger to mask Personally Identifiable Information (PII). Atlas serves as a common metadata store designed to exchange metadata both within and outside of the Hadoop stack. It features a simple user interface and a REST API for ease of access and integration. The Atlas-Ranger paradigm unites data classification with policy enforcement. Figures will be used as a graphical aid. Steps will be provided in between figures via bullet points.
This tutorial assumes the user has successfully installed Ranger, enabled Ranger audit to Solr, installed Atlas, installed Hive, configured Atlas to work with Hive, and configured Atlas to work with Ranger. It also assumes the user has dummy data for development purposes. For detailed instructions on how to accomplish these steps please review our HDP development documentation: HDP Developer Guide: Data Governance.
You can create the dummy data we will use in this tutorial via the following commands. Ensure you have the proper user privileges to write files to your local environment and copy files into HDFS.
Statement for creating the employee table
create table employee (ssn string, name string, location string) row format delimited fields terminated by ',' stored as textfile;
Statement for creating our dummy data
printf "111-111-111,Danny,San Jose\n222-222-222,Mike,Santa Clara\n333-333-333,Robert,Fremont\n345-444-134,James,San Diego\n759-3264-876,Tom,Santa Barbra\n253-856-987,Bobby,Holmdel\n675-883-996,Pat,San Clemente\n859-654-774,John,San Francisco\n568-412-778,Jane,Fremont\n546-889-778,Richard,San Jose\n554-996-332,Jeannine,Toms River\n587-665-124,Alex,New York" > employeedata.txt
Statement for transfer data to the Hive Warehouse
hdfs dfs -copyFromLocal employeedata.txt /apps/hive/warehouse/employee
Step 1: Creating our Hive table and populating data
Figure 1: The HIVE UI and create employee table statement
Figure 2: The CLI command to generate our dummy data
Figure 3: The HDFS command to transfer our data to the HIve Warehouse
Figure 4: The "select * from employee" Hive table.
Step 2: Creating a tag in the Atlas UI
Figure 1: The Atlas UI and the Creation of a Tag
Figure 2: Navigating to our hive_table entities and searching for the employee table
In this case let's select the employee table; which contains a ssn column. After we've clicked on the table we wish to work with; we can navigate to our columns of choice ssn. Figure 3
Figure 3: Employee hive_table entity and subsequent columns
Figure 4: Adding a tag via the tagging + button under ssn(hive_column)
Figure 5: The assignment of our PII tag to the employee.ssn column
Step 3: Creating a Tag Based Policy in Ranger
Figure 1: Accessing the Ranger Tag Based Policies
Figure 2: Creating a new tagging Service called PII_Tagging_Policy
Figure 3: Navigating to the Tagging Policy we just created
Figure 4: Adding a new masking policy within our PII_Tagging_Policy
Figure 5: Applying our Atlas Tag and Ranger Masking Policy
Figure 6: Navigating to the resource based policies page and the hive service folder
Figure 7: Applying our Ranger tagging policy to the Hive Resource
Step 4: Verifying our Users cannot access PII
Figure 1: The Hive UI and our masked ssn column, notice the Hash over our PII data
Congratulations; you're now able to successfully implement a tagging policy that masks Personally Identifiable Information so your organization can comply with regulatory guidelines. You can also perform a number of other tasks that are very similar to the policy you've just developed. For a full listing of features please refer to the developer guide above or post questions to the Hortonworks Community Connection. Suggestions for other tutorials are also highly encouraged.