Member since
09-29-2015
67
Posts
115
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1282 | 01-05-2016 05:03 PM | |
1965 | 12-31-2015 07:02 PM | |
1818 | 11-04-2015 03:38 PM | |
2249 | 10-19-2015 01:42 AM | |
1359 | 10-15-2015 02:22 PM |
03-14-2016
05:54 AM
11 Kudos
Summary
Enabling SSL encryption for the Web UIs that make up Hadoop is a tedious process that requires planning, learning to use security tools, and lots of mouse clicks through Ambari's UI. This article aims to simplify the process by presenting a semi-automated, start-to-finish example that enables SSL for the below Web UIs in the Hortonworks Sandbox:
Ambari HBase Oozie Ranger HDFS Planning
There is no substitute for reading the
documentation. If you plan on enabling SSL in a production cluster, then make sure you are familiar with SSL concepts and the communication paths between each HDP component. In addition, plan on cluster downtime. Here are some concepts that you should know well:
Certificate Authority (CA) A Certificate Authority is a company that others trust that signs certificates for a fee. On a Mac you can view a list of CAs that your computer trusts by opening up the "Keychain Access" application and clicking on "System Roots". If you don't want to pay one of these companies to sign your certificates, then you can generate your own CA, just beware the Google Chrome and other browsers will present you with a privacy warning.
Server SSL certificate These are files that prove the identity of a something, in our case: HDP services. Usually there is one certificate per hostname, and it is signed by a CA. There are two pieces of a certificate: the private and public keys. A private key is needed to encrypt a message and a public certificate is needed to decrypt the same message.
Java private keystore When Java HDP services need to encrypt messages, they need a place to look for the private key part of a server's SSL certificate. This keystore holds those private keys. It should be kept secure so that attackers cannot impersonate the service. For this reason, each HDP component in this article has its own private keystore.
Java trust keystore Just like my Mac has a list of CAs that it trusts, a Java process on a Linux machine needs the same. This keystore will usually hold the Public CA's certificate and any intermediary CA certificates. If a certificate was signed with a CA that you created yourself then also add the public part of a server's SSL certificate into this keystore.
Ranger plugins Ranger plugins communicate with Ranger Admin server over SSL. What is important to understand is where each plugin executes and thus where server SSL certificates are needed. For HDFS, the execution is on the NameNodes, for HBase, it is on the RegionServers, for YARN, it is on the ResourceManagers. When you create server SSL certificates use the hostnames where the plugins execute.
Enable SSL on HDP Sandbox
This part is rather easy. Install the HDP 2.4 Sandbox and follow the below steps. If you use an older version of the Sandbox note that you'll need to change the Ambari password used in the script.
Download my script
wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh"
Stop all services via Ambari (manually stop HDFS or Turn Off Maintenance Mode) Execute:
/bin/bash enable-ssl.sh --all
Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to sandbox.hortonworks.com Enable SSL in production
There are two big reasons why enabling SSL in production can be more difficult than in a sandbox:
If Hadoop components run in Highly Available mode. The solution for most instances is to create a single server SSL certificate and copy it to all HA servers. However, for Oozie you'll need a special server SSL certificate with CN=*.domainname.com If using Public CAs to sign server SSL certificates. Besides adding time to the process that is needed for the CA to sign your certificates you may also need additional steps to add intermediate CA certificates to the various Java trust stores and finding a CA that can sign non-FQDN server SSL certificates for Oozie HA
If you are using Ranger to secure anything besides HBase and HDFS then you will need to make changes to the script to enable extra plugins.
The steps are similar to enabling SSL in Sanbox:
Download my script
wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh"
Make changes to these variables inside of the script to reflect your cluster layout. The script uses these variables to generate certificates and copy them to all machines where they are needed. Below is an example for my three node cluster.
server1="example1.hortonworks.com"
server2="example2.hortonworks.com"
server3="example3.hortonworks.com"
OOZIE_SERVER_ONE=$server2
NAMENODE_SERVER_ONE=$server1
RESOURCE_MANAGER_SERVER_ONE=$server3
HISTORY_SERVER=$server1
HBASE_MASTER_SERVER_ONE=$server2
RANGER_ADMIN_SERVER=$server1
ALL_NAMENODE_SERVERS="${NAMENODE_SERVER_ONE} $server2"
ALL_OOZIE_SERVERS="${OOZIE_SERVER_ONE} $server3"
ALL_HBASE_MASTER_SERVERS="${HBASE_MASTER_SERVER_ONE} $server3"
ALL_HBASE_REGION_SERVERS="$server1 $server2 $server3"
ALL_REAL_SERVERS="$server1 $server2 $server3"
ALL_HADOOP_SERVERS="$server1 $server2 $server3"
export AMBARI_SERVER=$server1
AMBARI_PASS=xxxx
CLUSTER_NAME=cluster1
If you are going to pay a Public CA to sign your server SSL certificates then copy them to /tmp/security and name them as such:
ca.crt
example1.hortonworks.com.crt
example1.hortonworks.com.key
example2.hortonworks.com.crt
example2.hortonworks.com.key
example3.hortonworks.com.crt
example3.hortonworks.com.key
hortonworks.com.crt
hortonworks.com.key
The last certificate is needed for Oozie if you have Oozie HA enabled. The CN of that certificate should be CN=*.domainname.com as described hereIf you are NOT going to use a Public CA to sign your certificates, then change these lines in the script to be relevant to your organization:
/C=US/ST=New York/L=New York City/O=Hortonworks/OU=Consulting/CN=HortonworksCA
Stop all services via Ambari Execute:
/bin/bash enable-ssl.sh --all
Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to $NAMENODE_SERVER_ONE and $HBASE_MASTER_SERVER_ONE that you specified in the above script
If you chose not to enable SSL for some components or decide to modify the script to include others (please send me a patch) then be aware of these dependencies:
Setting up Ambari trust store is required before enabling SSL encryption for any other component Before you enable HBase SSL encryption, enable Hadoop SSL encryption Validation tips
View and verify SSL certificate being used by a server
openssl s_client -connect ${OOZIE_SERVER_ONE}:11443 -showcerts < /dev/null
View Oozie jobs through command-line
oozie jobs -oozie https://${OOZIE_SERVER_ONE}:11443/oozie
View certificates stored in a Java keystore
keytool -list -storepass password -keystore /etc/hadoop/conf/hadoop-private-keystore.jks
View Ranger policies for HDFS
cat example1.hortonworks.com.key example1.hortonworks.com.crt >> example1.hortonworks.com.pem
curl --cacert /tmp/security/ca.crt --cert /tmp/security/example1.hortonworks.com.pem "https://example1.hortonworks.com:6182/service/plugins/policies/download/cluster1_hadoop?lastKnownVersion=3&pluginId=hdfs@example1.hortonworks.com-cluster1_hadoop"
Validate that Ranger plugins can connect to Ranger admin server by searching for util.PolicyRefresher in HDFS NameNode and HBase RegionServer log files
References
GitHub repo Documentation to enable SSL for Ambari Oozie HDP documentation and Oozie documentation on apache.org Enable SSL encryption for Hadoop components Documentation for Ranger
... View more
03-06-2016
11:35 PM
1 Kudo
Thanks. I'll watch this JIRA for progress: https://issues.apache.org/jira/browse/HIVE-10924
... View more
03-02-2016
03:59 AM
Problems fixed. There is no longer a step to chroot Solr directory in Zookeeper.
... View more
03-01-2016
05:46 PM
@Artem Ervits Thanks for giving this tutorial a try. If you are getting the errors on an HDP Sandbox, would you send me the .vmdk file? I'll take a look and see what needs to change in the tutorial.
... View more
03-01-2016
12:52 AM
Yes, I should have added a link to GitHub: https://github.com/vzlatkin/DoctorsNotes
... View more
03-01-2016
12:22 AM
12 Kudos
Summary
Because patients visit many doctors, trends in their ailments and complaints may be difficult to identify. The steps in this article will help you address exactly this problem by creating a TagCloud of the most frequent complaints per patient. Below is a sample:
We will generate random HL7 MDM^T02 (v2.3) messages that contain a doctor's note about a fake patient and that patient's fake complaint to their doctor. Apache NiFi will be used to parse these messages and send them to Apache Solr. Finally Banana is used to create the visual dashboard.
In the middle of the dashboard is a TagCloud where the more frequently mentioned symptoms for a selected patient appear larger than others. Because this project relies on randomly generated data, some interesting results are possible. In this case, I got lucky and all the symptoms seem related to the patient's most frequent complaint: Morning Alcohol Drinking. The list of all possible symptoms comes from Google searches.
Summary of steps
Download and install the HDP Sandbox
Download and install the latest NiFi release
Download the HL7 message generator
Create a Solr dashboard to visualize the results
Create and execute a new NiFi flow
Detailed Step-by-step guide
1. Download and install the HDP Sandbox
Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below:
192.168.56.102 sandbox sandbox.hortonworks.com
2. Download and install the latest NiFi release
Follow the directions here. These were the steps that I executed for 0.5.1
wget http://apache.cs.utah.edu/nifi/0.5.1/nifi-0.5.1-bin.zip -O /tmp/nifi-0.5.1-bin.zip
cd /opt/
unzip /tmp/nifi-0.5.1-bin.zip
useradd nifi
chown -R nifi:nifi /opt/nifi-0.5.1/
perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.5.1/conf/bootstrap.conf
perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.5.1/conf/nifi.properties
/opt/nifi-0.5.1/bin/nifi.sh start
3. Download the HL7 message generator
A big thank you to HAPI for their excellent library to parse and create HL7 messages on which my code relies. The generator creates a very simple MDM^T02 that includes an in-line note from a doctor. MDM stands for Medical Document Management, and T02 specifies that this is a message for a new document. For more details about this message type read this document. Here is a sample message for Beatrice Cunningham:
MSH|^~\&|||||20160229002413.415-0500||MDM^T02|7|P|2.3
EVN|T02|201602290024
PID|1||599992601||cunningham^beatrice^||19290611|F
PV1|1|O|Burn center^60^71
TXA|1|CN|TX|20150211002413||||||||DOC-ID-10001|||||AU||AV
OBX|1|TX|1001^Reason For Visit: |1|Evaluated patient for skin_scaling. ||||||F
As a pre-requisite to executing the code, we need to install Java 8. Execute this on the Sandbox:
yum -y install java-1.8.0-openjdk.x86_64
Now, download the pre-build jar file that has the HL7 generator and execute it to create a single message in /tmp/hl7-messages. I chose to store the jar file in /var/ftp/pub because my IDE uploads files during code development. If you change this directory, also change it in the NiFi flow.
mkdir -p /var/ftp/pub
cd /var/ftp/pub
wget https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/target/hl7-generator-1.0-SNAPSHOT-shaded.jar
mkdir -p /tmp/hl7-messages/
/usr/lib/jvm/jre-1.8.0/bin/java -cp hl7-generator-1.0-SNAPSHOT-shaded.jar com.hortonworks.example.Main 1 /tmp/hl7-messages
chown -R nifi:nifi /tmp/hl7-messages/
4. Create a Solr dashboard to visualize the results
Now we need to configure Solr to ignore some words that don't add value. We do this by modifying stopwords.txt
cat <<EOF > /opt/hostname-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/stopwords.txt
adjustments
Admitted
because
blood
changes
complained
Discharged
Discussed
Drew
Evaluated
for
hospital
me
medication
of
patient
Performed
Prescribed
Reason
Recommended
Started
tests
The
to
treatment
visit
Visited
was
EOF
Next, we download the custom dashboard and start Solr in cloud mode
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
wget "https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/other/Chronic%20Symptoms%20(Solr).json" -O /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json
/opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181
/opt/hostname-hdpsearch/solr/bin/solr create -c hl7_messages -d data_driven_schema_configs -s 1 -rf 1
5. Create and execute a new NiFi flow
Start by downloading this NiFi template to your host machine.
To import the template, open the NiFi UI
Next, open Templates manager:
Click "Browse", then find the template on your local machine, click "Import", and close the Template Window.
Drag and drop to instantiate a new template:
Double click the new process group called HL7, and start all of the processes. To do so, hold down the Shift-key, and select all of the processes on the screen. Then click the "Start" button:
Here is a quick walk through of the processes starting in the top-left corner. First, we use ListFile process to get a directory listing from /tmp/hl7-messages. Second, the FetchFile process reads each file one-by-one, passes the contents to the next step, and deletes if successful. Third, the text file is parsed as an HL7 formatted message. Next, the UpdateAttribute and AttributesToJSON processes get the contents ready for insertion into Solr. Finally, we use the PutSolrContentStream process to add new documents via Solr REST API. The remaining two processes on the very bottom are for spawning the custom Java code and logging details for troubleshooting.
Conclusion
Now open the Banana UI. You should see a dashboard that looks similar to the screenshot in the beginning of this article. You can see how many messages have been processed by clicking the link in the top-right panel called "Filter By".
Troubleshooting
If you are not seeing any data in Solr/Banana, then reload the page. Also perform a search via this page to validate that results are being indexed via Solr correctly.
Full source code is located in GitHub.
... View more
Labels:
02-14-2016
05:26 PM
11 Kudos
For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script: #!/usr/bin/env bash
max_depth=5
largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "')
printf "%15s %s\n" "bytes" "directory"
for ld in $largest_root_dirs; do
printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld
all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" )
for d in $all_dirs; do
line=$(hdfs dfs -du -s $d)
size=$(echo $line | cut -d' ' -f1)
parent_dir=${d%/*}
child=${d##*/}
if [ -n "$parent_dir" ]; then
leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/')
d=${leading_dirs}/$child
fi
printf "%15.0f %s\n" $size $d
done
done
Sample output: bytes directory
480376973 /hdp
480376973 |---/apps
480376973 |--------/2.3.4.0-3485
98340772 |---------------------/hive
210320342 |---------------------/mapreduce
97380893 |---------------------/pig
15830286 |---------------------/sqoop
58504680 |---------------------/tez
24453973 /user
0 |----/admin
3629715 |----/ambari-qa
3440200 |--------------/.staging
653010 |-----------------------/job_1454293069490_0001
... View more
Labels:
02-10-2016
12:46 AM
1 Kudo
I found the documentation on how to do this without downtime: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#DataNode_Hot_Swap_Drive The only challenge that I encountered was the :port: in the command. It is the dfs.datanode.ipc.address parameter from hdfs-site.xml. My full command looked like this su - hdfs -c "hdfs dfsadmin -reconfig datanode sandbox.hortonworks.com:8010 start"
... View more
02-09-2016
02:12 AM
@Ancil McBarnett is there anyway to do this without downtime? Could you add a disk drive into a hot-swappable bay, add it to DataNode's list of directories, force a rebalance, and remove one of the old drives?
... View more
02-07-2016
11:08 PM
1 Kudo
I'm looking to generate reports on workflow performance in Oozie: failures, durations, users, etc... What is the best way to do so if Oozie is currently using DerbyDB and not MySQL? (Prior to moving Oozie to MySQL.)
I stopped the Oozie service and used the Phoenix sqlline tool:
su - oozie -c "/usr/hdp/current/oozie-server/bin/oozie-stop.sh"
java -cp .:/usr/hdp/2.3.2.0-2950/oozie/libserver/derby-10.10.1.1.jar:/usr/hdp/2.3.2.0-2950/phoenix/bin/../phoenix-4.4.0.2.3.2.0-2950-thin-client.jar sqlline.SqlLine -d org.apache.derby.jdbc.EmbeddedDriver -u jdbc:derby:/hadoop/oozie/data/oozie-db -n none -p none --color=true --fastConnect=false --verbose=true --isolation=TRANSACTION_READ_COMMITTED
0: jdbc:derby:/hadoop/oozie/data/oozie-db> CALL SYSCS_UTIL.SYSCS_EXPORT_TABLE ('OOZIE','WF_ACTIONS','WF_ACTIONS.del',null,null,null);
0: jdbc:derby:/hadoop/oozie/data/oozie-db> !outputformat vertical
0: jdbc:derby:/hadoop/oozie/data/oozie-db> !tables
0: jdbc:derby:/hadoop/oozie/data/oozie-db> SELECT STATUS, WF_ID, TYPE, NAME, EXECUTION_PATH, ERROR_MESSAGE FROM OOZIE.WF_ACTIONS;
... View more
Labels: