Created on 06-03-2020 07:40 PM - edited on 08-08-2021 10:00 PM by subratadas
This article explains how to quickly set up Cloudera Security/Governance/GDPR (Worldwide Bank) demo using Cloudera Data Platform - Private Cloud Base (formerly known as CDP-Data Center). It can be deployed either on AWS using AMI or on your own setup via provided script
yum install -y git
#setup KDC
curl -sSL https://gist.github.com/abajwa-hw/bca3d23fe146c3ebd59a9b5fd19480a3/raw | sudo -E sh
git clone https://github.com/fabiog1901/SingleNodeCDPCluster.git
cd SingleNodeCDPCluster
./setup_krb.sh gcp templates/wwbank_krb.json
#Setup worldwide bank demo using script
curl -sSL https://raw.githubusercontent.com/abajwa-hw/masterclass/master/ranger-atlas/setup-dc-703.sh | sudo -E bash
ssh -i ~/.ssh/mykey.pem centos@<publicIP>
tail -f /var/log/cdp_startup.log
CDP urls
hive.execute("SELECT surname, streetaddress, country, age, password, nationalid, ccnumber, mrn, birthday FROM worldwidebank.us_customers").show(10)
hive.execute("select zipcode, insuranceid, bloodtype from worldwidebank.ww_customers").show(10)
hive.execute("select * from cost_savings.claim_savings").show(10)
Confirm using Ranger audits that the queries ran as joe_analyst. Also, notice that column names, masking types, IPs, and policy IDs were captured. Also notice tags (such as DATA_QUALITY or PII) are captured along with their attributes. Also, notice that these audits were captured for operations across Hive, Hbase, Kafka, and HDFS:Similarly, you can log in to Zeppelin as etl_user and run his notebook as well
This notebook shows how an admin would handle GDPR scenarios like the following using Hive ACID capabilities:
Alternatively, you can log in to Hue as joe_analyst and select Query > Editor > Hive, and click Saved queries to run Joe's sample queries via Hive:
You can also switch the editor to Impala to run Joe's sample queries via Impala to show tag-based access policy working for Impala:
In CDP 7.1.1, Impala also supports column-based masking:
Alternatively, you can log in to Hue as ivanna_eu_hr and click Saved queries to run Ivanna's sample queries via Hive:
To run secure SparkSQL queries (using Hive Warehouse Connector):
kinit -kt /etc/security/keytabs/joe_analyst.keytab joe_analyst/$(hostname -f)@CLOUDERA.COM
spark-shell --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly*.jar --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://$(hostname -f):10000/default;" --conf "spark.sql.hive.hiveserver2.jdbc.url.principal=hive/$(hostname -f)@CLOUDERA.COM" --conf spark.security.credentials.hiveserver2.enabled=false
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.execute("select * from cost_savings.claim_savings").show(10)
/tmp/masterclass/ranger-atlas/HortoniaMunichSetup/run_spark_sql.sh
In case you encounter Thrift Exception like the following, it's likely the session was expired:
Just scroll to the top and click the Gears icon (near top right) to display the interpreters and restart the JDBC one:
Appendix:
The following are some older AMI links (for HDP releases):
Created on 02-04-2021 09:13 PM
@antonio_r Glad to hear it! Thanks for providing the details to help the next person. Have updated the article with the link to the networking prereqs. Enjoy your new cluster!
Created on 02-23-2021 04:09 AM
@antonio_r Could you share the location details of datasets/data inside the script, please?
Created on 02-23-2021 07:53 AM
@Rajesh2622 you can find the CSV files containing the data under https://github.com/abajwa-hw/masterclass/tree/master/ranger-atlas/HortoniaMunichSetup/data (e.g. ww_customers_data.csv)
Created on 04-12-2021 03:15 PM
@abajwa I have tried to use the given AMI 7.3/7.4 and received the below message.
You currently do not have a Cloudera Data Platform License. Please upload a valid license to access Cloudera Manager Admin Console. Your cluster is still functional, and all of your data remains intact.
Do you know how to get the 60 days trial version added to the given AMI file?
Thanks in advance
Created on 04-22-2021 01:59 PM
@abajwa Hi, thanks for your help in the past.
Now I have a new question: I want to try adding the Amundsen open source data catalog to the environment to see how it exposes all the datasets that you've populated. It depends on the availability of LDAP or similar to recognize the user who's viewing the data in the system.
Is there some local LDAP or other identity service included in this demo environment?
Thanks for any pointers, -Antonio