Member since
08-03-2017
14
Posts
17
Kudos Received
0
Solutions
05-16-2018
04:51 PM
“In meetings philosophy might work,
[but in] the field practicality works.” –Amit Kalantri Short Description: A quick tutorial on how to query cryptocurrency transactions with Spark Streaming Abstract: This tutorial covers how once can use Nifi to stream public cryptocurrency transactional data to Kafka and consequently query the transactional stream with Spark Streaming. We will be using the Pyspark API for demonstration purposes. It is strongly recommended that one use Java or Scala for production grade applications; as most of the generally available (GA) features in Spark are published to support those languages foremost. Figures are used as a user graphical aid. Steps are highlighted by bullet points. What is Spark? Why is it well suited for Streaming at scale? Apache Spark is a distributed processing engine for data science and data engineering at scale. It features several out of the box libraries for machine learning, map reduce, graph analysis, micro batching, and structured query language. It is well suited to poll streams of data due to its language-integrated application programming interface (API) and fault tolerant behavior. Users can write stream jobs the same way they would batch and recover both lost work and operator state without extra code. The Spark team has conveniently encapsulated these concerns into Spark’s core API. With Spark Streaming, one can ingest data from multiple sources (Kafka, Kinesis, Flume, and TCP sockets) and process data with abstracted functions like map, reduce, join, and window. Results can subsequently be pushed to HDFS, databases, and dashboards for meaningful insights. Simply put, Spark Streaming provides a very convenient way for Hadoop developers to grab data in realtime and persist it to their system of choice. What is micro batching? “Micro batching is defined as the procedure in which the incoming stream of messages is processed by dividing them into group of small batches. This helps to achieve the performance benefits of batch processing; however, at the same time, it helps to keep the latency of processing of each message minimal”. –Apache Spark 2.x for Java Developers , Sumit Kumar & Sourav Gulati, O’Reilly Media Inc. In Spark Streaming a micro batch is a sequence of resilient distributed datasets (RDDs) packaged into an abstraction known as a Discretized Streams (Streams). Users can batch DStreams on time intervals, with batches ranging between 1-10 seconds. Determining the batch size really depends on the needs of the downstream system consumer and overall applicability of how fast a consumer can write the batches to disk or present information to a web browser. Where are we getting our data? We will be subscribing to a Satori public data channel for global cryptocurrency transactions. According to Satori one can, “Get streaming data for cryptocurrency market prices as well as exchange rates between alternative cryptocurrencies. Data is streamed from exchanges around the globe, where crypto-coins are traded in the US Dollar, Canadian Dollar, British Pound, Japanese Yen, Russian Ruble, etc.” -https://www.satori.com/livedata/channels/cryptocurrency-market-data Note: This tutorial assumes the user has successfully installed Spark2 on HDP and Nifi and Kafka on HDF. It also assumes the user has some familiarity with Nifi and Kafka; but it isn’t required to successfully complete the tutorial. For detailed instructions on how to accomplish these steps please consult our HDP and HDF user documentation: HDP User Guide-Apache Spark HDF Installation Guide HDF User Guide-Apache Kafka One can also use the Hortonworks Sandbox environments to complete this tutorial:https://hortonworks.com/products/sandbox/ Tools: One can create the application logic for this tutorial by cloning my github repository. Feel free to make contributions by forking the repo and submitting pull requests. git clone https://github.com/patalwell/SparkStreamingKafka.git
... View more
Labels:
02-23-2018
07:44 PM
@Magnus Runesson This is something that should be supported: http://atlas.apache.org/Bridge-Hive.html Let me try and replicate.
... View more
09-07-2017
01:58 PM
9 Kudos
Short Description:
A quick tutorial on how to mask columns in Hive for regulatory purposes
Abstract:
This tutorial will cover how to apply tags to Atlas entities and subsequently leverage tagging policies in Ranger to mask Personally Identifiable Information (PII). Atlas serves as a common metadata store designed to exchange metadata both within and outside of the Hadoop stack. It features a simple user interface and a REST API for ease of access and integration. The Atlas-Ranger paradigm unites data classification with policy enforcement. Figures will be used as a graphical aid. Steps will be provided in between figures via bullet points.
Note:
This tutorial assumes the user has successfully installed Ranger, enabled Ranger audit to Solr, installed Atlas, installed Hive, configured Atlas to work with Hive, and configured Atlas to work with Ranger. It also assumes the user has dummy data for development purposes. For detailed instructions on how to accomplish these steps please review our HDP development documentation:
HDP Developer Guide: Data Governance.
Tools:
You can create the dummy data we will use in this tutorial via the following commands. Ensure you have the proper user privileges to write files to your local environment and copy files into HDFS.
Statement for creating the employee table
create table employee (ssn string, name string, location string)
row format delimited
fields terminated by ','
stored as textfile;
Statement for creating our dummy data
printf "111-111-111,Danny,San Jose\n222-222-222,Mike,Santa Clara\n333-333-333,Robert,Fremont\n345-444-134,James,San Diego\n759-3264-876,Tom,Santa Barbra\n253-856-987,Bobby,Holmdel\n675-883-996,Pat,San Clemente\n859-654-774,John,San Francisco\n568-412-778,Jane,Fremont\n546-889-778,Richard,San Jose\n554-996-332,Jeannine,Toms River\n587-665-124,Alex,New York" > employeedata.txt
Statement for transfer data to the Hive Warehouse
hdfs dfs -copyFromLocal employeedata.txt /apps/hive/warehouse/employee
How to Mask Columns in Hive with Atlas and Ranger
Step 1: Creating our Hive table and populating data
Log into Ambari and navigate to the Hive UI. Submit the create table statement from the tools above. Figure 1
Figure 1: The HIVE UI and create employee table statement
Generate our dummy data. Figure 2
Figure 2: The CLI command to generate our dummy data
Transfer our dummy data to the Hive warehouse and our employee table. Figure 3
Figure 3: The HDFS command to transfer our data to the HIve Warehouse
Ensure our data was properly populated into our Hive table via a select * from employee; SQL command. Figure 4
Figure 4: The "select * from employee" Hive table.
Step 2: Creating a tag in the Atlas UI
Now that we have a table with PII we can create a tag in atlas and leverage the event related metadata we created during the creation of our Hive table.
Log into Atlas UI and navigate to tags. Click the create tag button. Let's create a tag called PII and give it a description. Figure 1
Figure 1: The Atlas UI and the Creation of a Tag
Now that we have our PII tag, we can append our tag to an Atlas entity. In this case we want to search for hive_table; which is an entity atlas supports out of the box. Once we've found our entity, we can then search for the columns we wish to task and subsequently mask. Figure 2
Figure 2: Navigating to our hive_table entities and searching for the employee table
In this case let's select the employee table; which contains a ssn column. After we've clicked on the table we wish to work with; we can navigate to our columns of choice ssn.
Figure 3
Figure 3: Employee hive_table entity and subsequent columns
Once we've clicked on the ssn column under the key columns and value ssn we can add a tag via the button underneath the name of the Atlas hive_column entity. Figure 4
Let's select PII, the tag we previously created and click add. After we've tagged the ssn (hive_column) entity, our tags key will show a blue PII tag to let the user know a tag has been applied to the entity.
Figure 4: Adding a tag via the tagging + button under ssn(hive_column)
We can now navigate back to the tags page of our UI and see that our PII tag has been applied to the ssn column. Figure 5
This navigation and classification paradigm allows the user to mask multiple columns in a variety of datasets. This is just a simple example; but one can leverage this notion to apply tagging to other conventions. e.g the dev_group only has access to website_user_tables, the data_science_group only has access to specific tables, and the admin_group can assign create, update, alter SQL commands to users or groups that should have those privileges and assign select SQL commands to anyone who is not allowed to alter data.
Figure 5: The assignment of our PII tag to the employee.ssn column
Step 3: Creating a Tag Based Policy in Ranger
Now that we've created a table in Hive and assigned a tag in Atlas, we can create a tag based policy in Ranger.
Log into the Ranger UI and navigate to access manager >> tag based policies. Figure 1
Figure 1: Accessing the Ranger Tag Based Policies
Click the + button to create a new tag policy. Let's name ours PII_Tagging_Policy, add a description, and click Add. Figure 2
Figure 2: Creating a new tagging Service called PII_Tagging_Policy
After adding the new tag based policy, navigate back to the tag based policies service manager and click the tagging policy we just created. There are two tabs within our tagging policy: Access and Masking. Figure 3
Figure 3: Navigating to the Tagging Policy we just created
Since we want to apply a mask to the columns we've tagged in Atlas with PII we are going to create a masking policy, so navigate to the masking tab and select Add New Policy. Figure 4
Figure 4: Adding a new masking policy within our PII_Tagging_Policy
Let's name our new policy PII_Masking_Policy, and select the Atlas Tag we previously created.
Be sure to select the user our group you want to apply this tagging policy to. In this case, I am applying the policy to a single user, admin. Our mask conditions will be for Hive and we will select a Hash Mask
This is where Ranger integrates with Atlas and applies a hashing policy to the column entities we selected within Atlas. The Ranger tag policy will follow our PII tag on all of our Hive datasets; and we can now manage our how our policy is applied via Atlas. Figure 5 Be sure to include the relevant user or group you'd like apply this policy to. In this case, I choose the user admin.
Figure 5: Applying our Atlas Tag and Ranger Masking Policy
We still need to ensure Ranger applies this policy to our Hive resource within Ranger so let's navigate to access manager>> resource based policies and edit our hive service. In this case, we're editing the woolford_hdp_hive service. Click the pencil icon next to the service based policy name. Figure 6
If you do not have a hive service, you can create a new service and apply the proper settings.
Figure 6: Navigating to the resource based policies page and the hive service folder
Edit the hive service by selecting the Ranger tag policy we just created "PII_Tagging_Policy". There is a drop down menu next to the Select Tag Service property. Figure 7
Save your changes.
Figure 7: Applying our Ranger tagging policy to the Hive Resource Step 4: Verifying our Users cannot access PII
Now that we've created a tag, applied that tag to an Atlas entity, (our Hive column ssn from the employees table), created a Ranger tagging policy that leverages that tag and masks the information within the tagged column, and applied that tagging policy to our Hive service we can validate our policy from the Hive UI.
Navigate to the Hive UI and fire off a "select * from employee;" query.
You will now see a mask over the ssn column. Figure 1
Figure 1: The Hive UI and our masked ssn column, notice the Hash over our PII data
Conclusion:
Congratulations; you're now able to successfully implement a tagging policy that masks Personally Identifiable Information so your organization can comply with regulatory guidelines. You can also perform a number of other tasks that are very similar to the policy you've just developed. For a full listing of features please refer to the developer guide above or post questions to the Hortonworks Community Connection. Suggestions for other tutorials are also highly encouraged.
... View more
Labels: