Created on 06-03-2020 07:40 PM - edited on 08-08-2021 10:00 PM by subratadas
This article explains how to quickly set up Cloudera Security/Governance/GDPR (Worldwide Bank) demo using Cloudera Data Platform - Private Cloud Base (formerly known as CDP-Data Center). It can be deployed either on AWS using AMI or on your own setup via provided script
yum install -y git
#setup KDC
curl -sSL https://gist.github.com/abajwa-hw/bca3d23fe146c3ebd59a9b5fd19480a3/raw | sudo -E sh
git clone https://github.com/fabiog1901/SingleNodeCDPCluster.git
cd SingleNodeCDPCluster
./setup_krb.sh gcp templates/wwbank_krb.json
#Setup worldwide bank demo using script
curl -sSL https://raw.githubusercontent.com/abajwa-hw/masterclass/master/ranger-atlas/setup-dc-703.sh | sudo -E bash
ssh -i ~/.ssh/mykey.pem centos@<publicIP>
tail -f /var/log/cdp_startup.log
CDP urls
hive.execute("SELECT surname, streetaddress, country, age, password, nationalid, ccnumber, mrn, birthday FROM worldwidebank.us_customers").show(10)
hive.execute("select zipcode, insuranceid, bloodtype from worldwidebank.ww_customers").show(10)
hive.execute("select * from cost_savings.claim_savings").show(10)
Confirm using Ranger audits that the queries ran as joe_analyst. Also, notice that column names, masking types, IPs, and policy IDs were captured. Also notice tags (such as DATA_QUALITY or PII) are captured along with their attributes. Also, notice that these audits were captured for operations across Hive, Hbase, Kafka, and HDFS:Similarly, you can log in to Zeppelin as etl_user and run his notebook as well
This notebook shows how an admin would handle GDPR scenarios like the following using Hive ACID capabilities:
Alternatively, you can log in to Hue as joe_analyst and select Query > Editor > Hive, and click Saved queries to run Joe's sample queries via Hive:
You can also switch the editor to Impala to run Joe's sample queries via Impala to show tag-based access policy working for Impala:
In CDP 7.1.1, Impala also supports column-based masking:
Alternatively, you can log in to Hue as ivanna_eu_hr and click Saved queries to run Ivanna's sample queries via Hive:
To run secure SparkSQL queries (using Hive Warehouse Connector):
kinit -kt /etc/security/keytabs/joe_analyst.keytab joe_analyst/$(hostname -f)@CLOUDERA.COM
spark-shell --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly*.jar --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://$(hostname -f):10000/default;" --conf "spark.sql.hive.hiveserver2.jdbc.url.principal=hive/$(hostname -f)@CLOUDERA.COM" --conf spark.security.credentials.hiveserver2.enabled=false
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.execute("select * from cost_savings.claim_savings").show(10)
/tmp/masterclass/ranger-atlas/HortoniaMunichSetup/run_spark_sql.sh
In case you encounter Thrift Exception like the following, it's likely the session was expired:
Just scroll to the top and click the Gears icon (near top right) to display the interpreters and restart the JDBC one:
Appendix:
The following are some older AMI links (for HDP releases):
Created on 08-19-2020 07:02 AM
@abajwa , I have tried to use the given AMI and received the below message.
You currently do not have a Cloudera Data Platform License.
Please upload a valid license to access Cloudera Manager Admin Console. Your cluster is still functional, and all of your data remains intact.
Do you know how to get the 60 days trail version added to the given AMI file?
Thanks in advance
Created on 08-19-2020 06:29 PM
@AkhilTech thanks for your question. We just updated the article to include a link to a new AMI based on CDP 7.1.3. Alternatively, you can also use the script to deploy instead, which will give you a new trial license each time. To request a permanent license you can contact our sales team: https://www.cloudera.com/contact-sales.html
Created on 08-20-2020 03:40 AM
@abajwa , Thanks - I am able to use the new AMI
Created on 10-20-2020 06:07 PM - edited 10-20-2020 11:20 PM
I have a problem with the AMI
I'm using 0b3b57b9fa9a742ee so the: CDP-DC 7.1.3 with WorldWideBank demo v4
The machine has started but it is showing:
Are those images obsolete?
Created on 01-27-2021 12:14 PM - edited 01-27-2021 12:15 PM
Trying to get this working in a VMWare VM on an on-premise server. Running into a number of issues which I will try to troubleshoot but one thing that would help a lot, I think, is stating how much disk space is required at the outset. I have tried a couple of times and run into issues that seem to be related to not enough space on root or some other file system. I did find in a README that I probably need at least 100GB -- would have been useful to know that before I created the VM.
Sorry to complain but eager to get this working, finally, after a failed attempt about six months ago.
Created on 02-03-2021 01:06 PM
@boulder the AMI comes with trial license of CM which expires after 90 days. At that point, the services are all up but in order to open CM you would need to add a license.
Note: We just updated the article with an updated 7.1.4 AMI which has a fresh trial.
You can also use the script option to spin up a fresh cluster which gives a new trial license each time
@antonio_r thanks yes you'd need a VM with roughly the same specs as an m4.4xlarge would. Have updated the article with to include specs for option #1 as well
Created on 02-03-2021 01:15 PM - edited 02-03-2021 01:24 PM
Thanks for the clarification and for your efforts overall.
I tried the self-install on a Centos EC2 instance and that mostly worked. A number of the services report health issues-- HDFS shows up in CM as not starting, for example-- but surprisingly things seem to work anyway. I can run queries in the notebooks, for example, and the Ranger permissions apply.
I will revisit my home environment VM and see if I have better luck. (I give it ample resources -- 24 virtual CPUs and 96GB RAM and 150GB storage-- but I still seemed to hit issues.)
Are there any unusual features that the networking in EC2 provides vs. a vanilla VM? For example, the EC2 env answers at an internal AWS IP address as well as the public facing one. Do the services communicate with each other in some way that may require that? Do I need to reproduce that for my self-hosted VM to work? Do I, for example, need more than one virtual NIC?
Created on 02-03-2021 01:27 PM
@antonio_r after running the script, I noticed some services show weird state (even though they are up). You can restart "Cloudera Management Service" (scroll down to the bottom of the list of services, under Zookeeper)...that usually fixes it for me
Vanilla VM should work the same way...I have installed using the script on our internal Openstack env w/o issues. You should not require both internal/public IP. Just make sure the networking is setup as required by Hadoop: https://docs.cloudera.com/cloudera-manager/7.1.1/installation/topics/cdpdc-configure-network-names.h...
Created on 02-03-2021 01:28 PM - edited 02-03-2021 03:08 PM
thanks for the suggestion, I will investigate at next opportunity (probably over the weekend).
I am QUITE sure that I did not have the "Configure Network Names" steps done correctly when I tried several months ago, but I just couldn't figure out how to fix it. This should help quite a bit.
Created on 02-04-2021 07:44 PM - edited 02-04-2021 07:48 PM
Just make sure the networking is setup as required by Hadoop: https://docs.cloudera.com/cloudera-manager/7.1.1/installation/topics/cdpdc-configure-network-names.h...
That was the missing detail! For anyone else who tries this in VMWare
a) Do the network setup as described above on your host before you start the steps in this document
b) At the end of the networking instructions I struggled with
"Run host -v -t A $(hostname) and verify that the output matches the hostname command. The IP address should be the same as reported by ifconfig for eth0 (or bond0)..."
I found that I needed my own DNS server within the environment to make the networking stuff finally behave as described. I set up an instance of "dnsmasq" in my Centos Linux environment -- it's compact, lightweight, included with CentOS, and took about 3 minutes to configure, following the instructions here:
https://brunopaz.dev/blog/setup-a-local-dns-server-for-your-projects-on-linux-with-dnsmasq
About five lines of config and I was off to the races 🙂