About vzlatkin

vzlatkin · ‎03-14-2016

Summary Enabling SSL encryption for the Web UIs that make up Hadoop is a tedious process that requires planning, learning to use security tools, and lots of mouse clicks through Ambari's UI. This article aims to simplify the process by presenting a semi-automated, start-to-finish example that enables SSL for the below Web UIs in the Hortonworks Sandbox: Ambari HBase Oozie Ranger HDFS Planning There is no substitute for reading the documentation. If you plan on enabling SSL in a production cluster, then make sure you are familiar with SSL concepts and the communication paths between each HDP component. In addition, plan on cluster downtime. Here are some concepts that you should know well: Certificate Authority (CA) A Certificate Authority is a company that others trust that signs certificates for a fee. On a Mac you can view a list of CAs that your computer trusts by opening up the "Keychain Access" application and clicking on "System Roots". If you don't want to pay one of these companies to sign your certificates, then you can generate your own CA, just beware the Google Chrome and other browsers will present you with a privacy warning. Server SSL certificate These are files that prove the identity of a something, in our case: HDP services. Usually there is one certificate per hostname, and it is signed by a CA. There are two pieces of a certificate: the private and public keys. A private key is needed to encrypt a message and a public certificate is needed to decrypt the same message. Java private keystore When Java HDP services need to encrypt messages, they need a place to look for the private key part of a server's SSL certificate. This keystore holds those private keys. It should be kept secure so that attackers cannot impersonate the service. For this reason, each HDP component in this article has its own private keystore. Java trust keystore Just like my Mac has a list of CAs that it trusts, a Java process on a Linux machine needs the same. This keystore will usually hold the Public CA's certificate and any intermediary CA certificates. If a certificate was signed with a CA that you created yourself then also add the public part of a server's SSL certificate into this keystore. Ranger plugins Ranger plugins communicate with Ranger Admin server over SSL. What is important to understand is where each plugin executes and thus where server SSL certificates are needed. For HDFS, the execution is on the NameNodes, for HBase, it is on the RegionServers, for YARN, it is on the ResourceManagers. When you create server SSL certificates use the hostnames where the plugins execute. Enable SSL on HDP Sandbox This part is rather easy. Install the HDP 2.4 Sandbox and follow the below steps. If you use an older version of the Sandbox note that you'll need to change the Ambari password used in the script. Download my script wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh" Stop all services via Ambari (manually stop HDFS or Turn Off Maintenance Mode) Execute: /bin/bash enable-ssl.sh --all Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to sandbox.hortonworks.com Enable SSL in production There are two big reasons why enabling SSL in production can be more difficult than in a sandbox: If Hadoop components run in Highly Available mode. The solution for most instances is to create a single server SSL certificate and copy it to all HA servers. However, for Oozie you'll need a special server SSL certificate with CN=*.domainname.com If using Public CAs to sign server SSL certificates. Besides adding time to the process that is needed for the CA to sign your certificates you may also need additional steps to add intermediate CA certificates to the various Java trust stores and finding a CA that can sign non-FQDN server SSL certificates for Oozie HA If you are using Ranger to secure anything besides HBase and HDFS then you will need to make changes to the script to enable extra plugins. The steps are similar to enabling SSL in Sanbox: Download my script wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh" Make changes to these variables inside of the script to reflect your cluster layout. The script uses these variables to generate certificates and copy them to all machines where they are needed. Below is an example for my three node cluster. server1="example1.hortonworks.com" server2="example2.hortonworks.com" server3="example3.hortonworks.com" OOZIE_SERVER_ONE=$server2 NAMENODE_SERVER_ONE=$server1 RESOURCE_MANAGER_SERVER_ONE=$server3 HISTORY_SERVER=$server1 HBASE_MASTER_SERVER_ONE=$server2 RANGER_ADMIN_SERVER=$server1 ALL_NAMENODE_SERVERS="${NAMENODE_SERVER_ONE} $server2" ALL_OOZIE_SERVERS="${OOZIE_SERVER_ONE} $server3" ALL_HBASE_MASTER_SERVERS="${HBASE_MASTER_SERVER_ONE} $server3" ALL_HBASE_REGION_SERVERS="$server1 $server2 $server3" ALL_REAL_SERVERS="$server1 $server2 $server3" ALL_HADOOP_SERVERS="$server1 $server2 $server3" export AMBARI_SERVER=$server1 AMBARI_PASS=xxxx CLUSTER_NAME=cluster1 If you are going to pay a Public CA to sign your server SSL certificates then copy them to /tmp/security and name them as such: ca.crt example1.hortonworks.com.crt example1.hortonworks.com.key example2.hortonworks.com.crt example2.hortonworks.com.key example3.hortonworks.com.crt example3.hortonworks.com.key hortonworks.com.crt hortonworks.com.key The last certificate is needed for Oozie if you have Oozie HA enabled. The CN of that certificate should be CN=*.domainname.com as described hereIf you are NOT going to use a Public CA to sign your certificates, then change these lines in the script to be relevant to your organization: /C=US/ST=New York/L=New York City/O=Hortonworks/OU=Consulting/CN=HortonworksCA Stop all services via Ambari Execute: /bin/bash enable-ssl.sh --all Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to $NAMENODE_SERVER_ONE and $HBASE_MASTER_SERVER_ONE that you specified in the above script If you chose not to enable SSL for some components or decide to modify the script to include others (please send me a patch) then be aware of these dependencies: Setting up Ambari trust store is required before enabling SSL encryption for any other component Before you enable HBase SSL encryption, enable Hadoop SSL encryption Validation tips View and verify SSL certificate being used by a server openssl s_client -connect ${OOZIE_SERVER_ONE}:11443 -showcerts < /dev/null View Oozie jobs through command-line oozie jobs -oozie https://${OOZIE_SERVER_ONE}:11443/oozie View certificates stored in a Java keystore keytool -list -storepass password -keystore /etc/hadoop/conf/hadoop-private-keystore.jks View Ranger policies for HDFS cat example1.hortonworks.com.key example1.hortonworks.com.crt >> example1.hortonworks.com.pem curl --cacert /tmp/security/ca.crt --cert /tmp/security/example1.hortonworks.com.pem "https://example1.hortonworks.com:6182/service/plugins/policies/download/cluster1_hadoop?lastKnownVersion=3&pluginId=hdfs@example1.hortonworks.com-cluster1_hadoop" Validate that Ranger plugins can connect to Ranger admin server by searching for util.PolicyRefresher in HDFS NameNode and HBase RegionServer log files References GitHub repo Documentation to enable SSL for Ambari Oozie HDP documentation and Oozie documentation on apache.org Enable SSL encryption for Hadoop components Documentation for Ranger

vzlatkin · ‎03-06-2016

Thanks. I'll watch this JIRA for progress: https://issues.apache.org/jira/browse/HIVE-10924

vzlatkin · ‎03-02-2016

Problems fixed. There is no longer a step to chroot Solr directory in Zookeeper.

vzlatkin · ‎03-01-2016

@Artem Ervits Thanks for giving this tutorial a try. If you are getting the errors on an HDP Sandbox, would you send me the .vmdk file? I'll take a look and see what needs to change in the tutorial.

vzlatkin · ‎03-01-2016

Yes, I should have added a link to GitHub: https://github.com/vzlatkin/DoctorsNotes

vzlatkin · ‎03-01-2016

Summary Because patients visit many doctors, trends in their ailments and complaints may be difficult to identify. The steps in this article will help you address exactly this problem by creating a TagCloud of the most frequent complaints per patient. Below is a sample: We will generate random HL7 MDM^T02 (v2.3) messages that contain a doctor's note about a fake patient and that patient's fake complaint to their doctor. Apache NiFi will be used to parse these messages and send them to Apache Solr. Finally Banana is used to create the visual dashboard. In the middle of the dashboard is a TagCloud where the more frequently mentioned symptoms for a selected patient appear larger than others. Because this project relies on randomly generated data, some interesting results are possible. In this case, I got lucky and all the symptoms seem related to the patient's most frequent complaint: Morning Alcohol Drinking. The list of all possible symptoms comes from Google searches. Summary of steps Download and install the HDP Sandbox Download and install the latest NiFi release Download the HL7 message generator Create a Solr dashboard to visualize the results Create and execute a new NiFi flow Detailed Step-by-step guide 1. Download and install the HDP Sandbox Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below: 192.168.56.102 sandbox sandbox.hortonworks.com 2. Download and install the latest NiFi release Follow the directions here. These were the steps that I executed for 0.5.1 wget http://apache.cs.utah.edu/nifi/0.5.1/nifi-0.5.1-bin.zip -O /tmp/nifi-0.5.1-bin.zip cd /opt/ unzip /tmp/nifi-0.5.1-bin.zip useradd nifi chown -R nifi:nifi /opt/nifi-0.5.1/ perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.5.1/conf/bootstrap.conf perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.5.1/conf/nifi.properties /opt/nifi-0.5.1/bin/nifi.sh start 3. Download the HL7 message generator A big thank you to HAPI for their excellent library to parse and create HL7 messages on which my code relies. The generator creates a very simple MDM^T02 that includes an in-line note from a doctor. MDM stands for Medical Document Management, and T02 specifies that this is a message for a new document. For more details about this message type read this document. Here is a sample message for Beatrice Cunningham: MSH|^~\&|||||20160229002413.415-0500||MDM^T02|7|P|2.3 EVN|T02|201602290024 PID|1||599992601||cunningham^beatrice^||19290611|F PV1|1|O|Burn center^60^71 TXA|1|CN|TX|20150211002413||||||||DOC-ID-10001|||||AU||AV OBX|1|TX|1001^Reason For Visit: |1|Evaluated patient for skin_scaling. ||||||F As a pre-requisite to executing the code, we need to install Java 8. Execute this on the Sandbox: yum -y install java-1.8.0-openjdk.x86_64 Now, download the pre-build jar file that has the HL7 generator and execute it to create a single message in /tmp/hl7-messages. I chose to store the jar file in /var/ftp/pub because my IDE uploads files during code development. If you change this directory, also change it in the NiFi flow. mkdir -p /var/ftp/pub cd /var/ftp/pub wget https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/target/hl7-generator-1.0-SNAPSHOT-shaded.jar mkdir -p /tmp/hl7-messages/ /usr/lib/jvm/jre-1.8.0/bin/java -cp hl7-generator-1.0-SNAPSHOT-shaded.jar com.hortonworks.example.Main 1 /tmp/hl7-messages chown -R nifi:nifi /tmp/hl7-messages/ 4. Create a Solr dashboard to visualize the results Now we need to configure Solr to ignore some words that don't add value. We do this by modifying stopwords.txt cat <<EOF > /opt/hostname-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/stopwords.txt adjustments Admitted because blood changes complained Discharged Discussed Drew Evaluated for hospital me medication of patient Performed Prescribed Reason Recommended Started tests The to treatment visit Visited was EOF Next, we download the custom dashboard and start Solr in cloud mode export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 wget "https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/other/Chronic%20Symptoms%20(Solr).json" -O /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json /opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181 /opt/hostname-hdpsearch/solr/bin/solr create -c hl7_messages -d data_driven_schema_configs -s 1 -rf 1 5. Create and execute a new NiFi flow Start by downloading this NiFi template to your host machine. To import the template, open the NiFi UI Next, open Templates manager: Click "Browse", then find the template on your local machine, click "Import", and close the Template Window. Drag and drop to instantiate a new template: Double click the new process group called HL7, and start all of the processes. To do so, hold down the Shift-key, and select all of the processes on the screen. Then click the "Start" button: Here is a quick walk through of the processes starting in the top-left corner. First, we use ListFile process to get a directory listing from /tmp/hl7-messages. Second, the FetchFile process reads each file one-by-one, passes the contents to the next step, and deletes if successful. Third, the text file is parsed as an HL7 formatted message. Next, the UpdateAttribute and AttributesToJSON processes get the contents ready for insertion into Solr. Finally, we use the PutSolrContentStream process to add new documents via Solr REST API. The remaining two processes on the very bottom are for spawning the custom Java code and logging details for troubleshooting. Conclusion Now open the Banana UI. You should see a dashboard that looks similar to the screenshot in the beginning of this article. You can see how many messages have been processed by clicking the link in the top-right panel called "Filter By". Troubleshooting If you are not seeing any data in Solr/Banana, then reload the page. Also perform a search via this page to validate that results are being indexed via Solr correctly. Full source code is located in GitHub.

vzlatkin · ‎02-14-2016

For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script: #!/usr/bin/env bash max_depth=5 largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "') printf "%15s %s\n" "bytes" "directory" for ld in $largest_root_dirs; do printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" ) for d in $all_dirs; do line=$(hdfs dfs -du -s $d) size=$(echo $line | cut -d' ' -f1) parent_dir=${d%/*} child=${d##*/} if [ -n "$parent_dir" ]; then leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/') d=${leading_dirs}/$child fi printf "%15.0f %s\n" $size $d done done Sample output: bytes directory 480376973 /hdp 480376973 |---/apps 480376973 |--------/2.3.4.0-3485 98340772 |---------------------/hive 210320342 |---------------------/mapreduce 97380893 |---------------------/pig 15830286 |---------------------/sqoop 58504680 |---------------------/tez 24453973 /user 0 |----/admin 3629715 |----/ambari-qa 3440200 |--------------/.staging 653010 |-----------------------/job_1454293069490_0001

vzlatkin · ‎02-10-2016

I found the documentation on how to do this without downtime: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#DataNode_Hot_Swap_Drive The only challenge that I encountered was the :port: in the command. It is the dfs.datanode.ipc.address parameter from hdfs-site.xml. My full command looked like this su - hdfs -c "hdfs dfsadmin -reconfig datanode sandbox.hortonworks.com:8010 start"

vzlatkin · ‎02-09-2016

@Ancil McBarnett is there anyway to do this without downtime? Could you add a disk drive into a hot-swappable bay, add it to DataNode's list of directories, force a rebalance, and remove one of the old drives?

vzlatkin · ‎02-07-2016

I'm looking to generate reports on workflow performance in Oozie: failures, durations, users, etc... What is the best way to do so if Oozie is currently using DerbyDB and not MySQL? (Prior to moving Oozie to MySQL.) I stopped the Oozie service and used the Phoenix sqlline tool: su - oozie -c "/usr/hdp/current/oozie-server/bin/oozie-stop.sh" java -cp .:/usr/hdp/2.3.2.0-2950/oozie/libserver/derby-10.10.1.1.jar:/usr/hdp/2.3.2.0-2950/phoenix/bin/../phoenix-4.4.0.2.3.2.0-2950-thin-client.jar sqlline.SqlLine -d org.apache.derby.jdbc.EmbeddedDriver -u jdbc:derby:/hadoop/oozie/data/oozie-db -n none -p none --color=true --fastConnect=false --verbose=true --isolation=TRANSACTION_READ_COMMITTED 0: jdbc:derby:/hadoop/oozie/data/oozie-db> CALL SYSCS_UTIL.SYSCS_EXPORT_TABLE ('OOZIE','WF_ACTIONS','WF_ACTIONS.del',null,null,null); 0: jdbc:derby:/hadoop/oozie/data/oozie-db> !outputformat vertical 0: jdbc:derby:/hadoop/oozie/data/oozie-db> !tables 0: jdbc:derby:/hadoop/oozie/data/oozie-db> SELECT STATUS, WF_ID, TYPE, NAME, EXECUTION_PATH, ERROR_MESSAGE FROM OOZIE.WF_ACTIONS;

Online	Offline
Last Visited	‎04-23-2020 07:35 PM

Member Since	‎09-29-2015 09:15 PM
Last Visited	‎04-23-2020 07:35 PM
Posts	67
Kudos received	113

Cloudera Community

Re: how to use path globs to copy files from local...

Re: Falcon tutorial - rawEmailIngestProcess shell-...

Re: How to bind host for oozie service in a Multih...

Re: Falcon replication and mirroring between two K...

Re: Oozie SSH Action - StrictHostKeyChecking=no

Quickly enable SSL encryption for Hadoop component...

Re: How to update Hive row with JOIN

Re: Visualize patients' complaints to their doctor...

Re: Visualize patients' complaints to their doctor...

Re: Visualize patients' complaints to their doctor...

Visualize patients' complaints to their doctors us...

How to identify what is consuming space in HDFS

Re: How to Move or Change HDFS DataNode Directorie...

Re: How to Move or Change HDFS DataNode Directorie...

How to export data from Oozie Derby database?