Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)

HDP Security/Governance/GDPR Demo kit

Summary

How to quickly setup HDP Security/Governance/GDPR (HortoniaBank) demo. It can be deployed either on AWS using AMI or on your own setup using script

Whats included

  • Single node HDP 3.1 or 2.6.5 including:
    • Kerberos - for authentication (via local MIT KDC)
    • Ranger - for authorization (via both resource/tag based policies for access and masking)
    • Atlas - for governance (classification/lineage/search)
    • Zeppelin - for running/visualizing Hive queries
    • NiFi - to show how lineage is captured for events ingested from the edge
    • Knox - for SSO and proxying
  • HortoniaBank artifacts
  • Demo hive tables
  • Demo tags/attributes and lineage in Atlas
  • Demo Zeppelin notebooks to walk through demo scenario
  • Ranger policies across HDFS, Hive, Hbase, Kafka, Atlas to showcase:
    • Tag based access policies across HDFS/Hive/HBase/Kafka
    • Row level filtering on co-mingled datasets in Hive
    • Dynamic tag based masking in Hive columns
    • Time bound Ranger policies
    • Hive HDF execution authorization
    • Classifications (tags) in Atlas
    • Tag propagation
    • Data lineage in Atlas for HDFS, Hive, NiFi, Hbase
    • GDPR Scenarios around consent and data erasure
    • Single-sign-on and proxying across Ambari/Ranger/Atlas/Zeppelin

Option 1: Steps to deploy on your own setup

Follow README from github: https://github.com/abajwa-hw/masterclass/blob/master/ranger-atlas

Caveats of using script:

  • This only deploys the HDP 3.1 version of the demo
  • To skip setup of SSO/proxy, switch enable_knox_sso_proxy=false before running

Option 2: Steps to launch prebuilt AMI on AWS

  • 1. Launch Amazon AWS console page in your browser by clicking here and sign in with your credentials. Once signed in, you can close this browser tab.

  • 2. Select the AMI from ‘N. California’ region by clicking one of the below options
    • For HDP 3.1 with Knox SSO: click here
    • For HDP 3.1 without Knox SSO: click here
    • For HDP 3.0.1 with NiFi and Knox SSO: click here.
    • For HDP 3.0 without NiFi or Knox SSO: click here
    • For HDP 2.6.5 with NiFi and Knox SSO: click here
    • For HDP 2.6.5 with only NiFi: click here
  • Now choose instance type: select ‘m4.2xlarge’ and click Next

Note: if you choose a smaller instance type from the above recommendation, not all services may come up

  • 3. Configure Instance Details: leave the defaults and click ‘Next’

  • 4. Add storage: keep the default of 800 GB and click ‘Next’

  • 5. Optionally, add a name or any other tags you like. Then click ‘Next’

  • 6. Configure security group: create a new security group and select ‘All traffic’ to open all ports. For production usage, a more restrictive security group policy is strongly encouraged. As an instance only allow traffic from your company’s IP range. Then click ‘Review and Launch’

  • 7. Review your settings and click Launch

  • 8. Create and download a new key pair (or choose an existing one). Then click ‘Launch instances’

  • 9. Click the shown link under ‘Your instances are now launching’

  • 10. This opens the EC2 dashboard that shows the details of your launched instance

9235-screen-shot-2016-11-08-at-100930-am.jpg

  • 11. Make note of your instance’s ‘Public IP’ (which will be used to access your cluster) and the ‘Owner’ id (which will be the default password). If the ‘Public IP’ is blank, wait 1-2 minutes for this to be populated
  • For Knox SSO, you will need to create a entry for demo.hortonworks.com (pointing to the Public IP from above) in your local laptop's hosts file. For instructions on how to do this see here
  • 12. After 5-10 minutes, open the below URL in your browser to access Ambari’s console:
    • If you selected an AMI that has Knox SSO, use either Firefox or Incognito Chome window to navigate to http://demo.hortonworks.com:8080 where you will be shown the Knox sign-on login page. Enter user:admin password:BadPass#1
    • If not using Knox SSO, you would login to http://<PUBLIC IP>:8080 as admin user using your ‘Owner’ id as password (you can find your owner id in instance details page as highlighted above)

  • 13. At this point, Ambari may still be in the process of starting all the services. You can tell by the presence of the blue ‘op’ notification near the top left of the page. If so, just wait until it is done.

(Optional) You can also monitor the startup using the log as below:

  • Open SSH session into the VM using your key and the public IP e.g. from OSX:

ssh -i ~/.ssh/mykey.pem centos@<publicIP>

  • Tail the startup log:

tail -f /var/log/hdp_startup.log

  • Once you see “cluster is ready!” you can proceed
  • 14. Once the blue ‘op’ notification disappears and all the services (except Hive) show a green check mark, the cluster is fully up.
  • NOTE: There will be 2 instances of Hive installed: HiverServer and HiveServerInteractive. HiveServerInteractive takes up most of the cluster resources so HiverServer may not come up. This error can be ignored as the demo only makes use of HiveServerInteractive (i.e. Hive 2.x). You may also notice that YARN memory is 100% in use: this is also normal and can be ignored

The below alerts can be ignored (see above)

If any other services fail to start, use the Actions > Start All button to start

Accessing cluster URLs (when Knox SSO enabled)

https://demo.hortonworks.com:8443/gateway/ui/ambari

https://demo.hortonworks.com:8443/gateway/ui/ranger

https://demo.hortonworks.com:8443/gateway/ui/atlas

https://demo.hortonworks.com:8443/gateway/ui/yarn

https://demo.hortonworks.com:8443/gateway/ui/yarnuiv2

https://demo.hortonworks.com:8443/gateway/manager/admin-ui/

https://demo.hortonworks.com:8443/gateway/ui/hdfs/?host=http://demo.hortonworks.com:50070 https://demo.hortonworks.com:8443/gateway/ui/nifi/

https://demo.hortonworks.com:8443/gateway/ui/zeppelin/

  • ivanna_eu_hr = BadPass#1
  • joe_analyst = BadPass#1
  • etl_user = BadPass#1
  • scott_intern = BadPass#1

Accessing cluster URLs (without Knox SSO)

1. Access Ambari at port 8080. Ambari admin password will be reset to your Amazon owner id (as described above). This is for security purposes but you can change this if desired.

2. Access Ranger at port 6080. Ranger login is admin/BadPass#1

3. Access Atlas on port 21000. Atlas login is admin/BadPass#1

4. Access Zeppelin on port 9995. Zeppelin users logins are:

  • ivanna_eu_hr = BadPass#1
  • joe_analyst = BadPass#1
  • etl_user = BadPass#1

5. (Optional) To kinit via terminal

kinit admin/admin (password: hadoop)

Demo walkthrough

  1. Once services are up, open Ranger UI and also login to Zeppelin as ivanna_eu_hr and find her notebook by searching for "hortonia" using the text field under ‘Notebook’ section. Select the notebook called: “HortoniaBank - Ivana EU HR”

  1. On first launch of the notebook, you may be prompted to choose interpreters. You can keep the defaults but make sure you click Save button:

Run through the notebook cells using Play button in top right of each cell (or Shift-Enter)

This notebook highlights:

a) Row level filtering: as Ivana can only see European data (even though she is querying world wide customer table). Below is the Ranger hive policy that enables this feature:

b) it also shows that since Ivana is part of HR group, there are no policies that limit her access: so she can see raw passwords, nationalIDs, credit card numbers, MRN #, birthdays etc

3. Once you successfully run the notebook, you can open the Ranger Audits and YARN UI to show the policies and that the queries ran as her and that row filtering occurred (notice ROW_FILTER access type):

4. Log out of Zeppelin

5. Log back in as joe_analyst. Find his notebook by searching for "hortonia" using the text field under ‘Notebook’ section. Select the notebook called: “HortoniaBank - Joe Analyst”

On first launch of the notebook, you may be prompted to choose interpreters. You can keep the defaults but make sure you click Save button:

6. Run through the notebook. This notebook shows

a) MRN/password masked via tag policy. Here is the Ranger policy that enables this:

b) dynamic column level masking

Address, nationalID, credit card number are masked via Hive column policies specified in Ranger. Notice that Birthday and age columns are masked using custom mask

b) It also shows prohibition policy where zipcode, insuranceid and bloodtype can not be combined in a query

c) also shows tag based policies:

Attempts to access any object tagged with EXPIRES_ON accessed after expiry date, will be denied. As we will show later, the fed_tax column of tax_2015 table is tagged in Atlas as EXPIRED_ON with expiry date of 2016 so it should not be allowed to be queried

Also attempts to access object tagged with PII will be denied as per policy, only HR allowed. As we will show later, the ssn column of tax_2015 table is tagged as PII in Atlas.

Attempts to access cost_savings.claim_savings table as analyst will fail because there is a policy that minimum of 60% data quality score is required for analysts. As we will see, this table is tagged in Atlas as having score of 51%

The bottom of the notebook also shows how tag based policies can be used with Kafka topics, Hbase tables or HDFS directories: those tagged as SENSITIVE are accessible by ivanna_eu_hr but not by joe_analyst

7. Confirm using Ranger audits and Yarn RM UI that the queries ran as joe_analyst. Also notice that column names, masking types, IPs and policy IDs were captured. Also notice tags (e.g. DATA_QUALITY or PII) are captured along with their attributes

8. Similarly, you can log out of Zeppelin and re-log back in as etl_user and run his notebook to see how for Ivanna EU customers who had not provided consent for marketing were filtered out, but etl_user can see them.

The notebook also covers how customers withdrawal of consent goes into effect, as well as customers making erasure requests. The key thing to notice is how we are able to makes updates/deletes to immutable filesystem using the new Hive ACID capabilities.

Exploring Atlas

Login to Atlas and look for the Hive columns tagged as EXPIRES_ON

To see the table name, you can select Table in the Column dropdown

Now notice the table name is also displayed

Selecting one of the columns and opening the Tag tab shows the attributes of tag (expiry_date) and value

To save this search, you can click the “Save As” button near bottom left:

Similarly you can query for hive tables tagged with DATA_QUALITY...

...and click on it to confirm the quality score associated with this table is less than 60%

10. Use Atlas to query for hive_tables and pick provider_summary to show lineage and impact

You can use the Audits tab to see audits on this table

You can use Schema tab to inspect table schema

Lineage for NiFi flow:

A sample NiFi flow has been provided that reads tweets, converts to pipe delimited format, merges them into HDFS files. To see the lineage created by Nifi, in Atlas select hdfs_path and select a file from /tmp/twitter_staging. You will see something like below:

77777-screen-shot-2018-06-20-at-13828-am.png

11. Finally, navigate to Atlas home page and notice the option to create a new entity

Sample out of the box entity types that you can create shown below:

Selecting an entity type (e.g. hdfs_path), shows what required and optional fields you would need to provide to manually create the new entity

AMI Limitations

  • HiveServer2Interactive takes up YARN resources so HiveServer2 may not come up. This will not impact the running of the demo.

Questions?

In case of questions or issues:

  • 1. Search on our Hortonworks Community Connection forum. For example, to find all Demo Kit related posts access this url
  • 2. If you were not able to find the solution, please post a new question using the tag “partner-demo-kit” here. Please try to be as descriptive as possible when asking questions by providing:
    • Detailed description of problem
    • Steps to reproduce problem
    • Environment details e.g.
  • Instance type used was m4.2xlarge
  • Storage used was 500gb
  • Etc
  • Relevant log file snippets
  • 5,521 Views
    Comments
    Not applicable

    Dear team

    Accessing cluster URLs (when Knox SSO enabled) do not work, since the ERROR message return :

    HTTP ERROR: 404

    Problem accessing /gateway/ui/zeppelin/. Reason:

        Not Found

    Powered by Jetty://

    This message is return for all the links:

    https://demo.hortonworks.com:8443/gateway/ui/ambari

    https://demo.hortonworks.com:8443/gateway/ui/ranger

    https://demo.hortonworks.com:8443/gateway/ui/atlas

    https://demo.hortonworks.com:8443/gateway/ui/yarn

    https://demo.hortonworks.com:8443/gateway/ui/yarnuiv2

    https://demo.hortonworks.com:8443/gateway/manager/admin-ui/

    https://demo.hortonworks.com:8443/gateway/ui/hdfs/?host=http://demo.hortonworks.com:50070 https://demo.hortonworks.com:8443/gateway/ui/nifi/

    from HDP 2.6 with SSO enable

    Don't have an account?
    Coming from Hortonworks? Activate your account here
    Version history
    Revision #:
    2 of 2
    Last update:
    ‎08-17-2019 09:51 AM
    Updated by:
     
    Contributors
    Top Kudoed Authors