About vzlatkin

vzlatkin · ‎05-30-2016

Predicting stock portfolio losses using Monte Carlo simulation in Spark Summary Have you ever asked yourself: what is the most money my stock holdings could lose in a single day? If you own stock through a 401k, a personal trading account, or employer provided stock options then you should absolutely ask yourself this question. Now think about how to answer it. Your first guess maybe to pick a random number, say 20%, and assume that is the worst case scenario. While simple, this is likely to be wildly inaccurate and certainly doesn’t take into account the positive impacts of a diversified portfolio. Surprisingly, a good estimate is hard to calculate. Luckily, financial institutions have to do this for their stock portfolios (called Value at Risk (VaR)), and we can apply their methods to individual portfolios. In this article we will run a Monte Carlo simulation using real trading data to try to quantify what can happen to your portfolio. You should now go to your broker website (Fidelity, E*Trade, etc...) and get a list of stocks that you own and the % that each holding represents of the total portfolio. How it works The Monte Carlo method is one that uses repeated sampling to predict a result. As a real-world example, think about how you might predict where your friend is aiming while throwing a dart at a dart board. If you were following the Monte Carlo method, you'd ask your friend to throw a 100 darts with the same aim, and then you'd make a prediction based on the largest cluster of darts. To predict stock returns we are going to pick 1,000,000 previous trading dates at random and see what happened to on those dates. The end result is going to be some aggregation of those results. We will download historical stock trading data from Yahoo Finance and store them into HDFS. Then we will create a table in Spark like the below and pick a million random dates from it. GS AAPL GE OIL 2015-01-05 -3.12% -2.81% -1.83% -6.06% 2015-01-06 -2.02% -0.01% -2.16% -4.27% 2015-01-07 +1.48% +1.40% +0.04% +1.91% 2015-01-08 +1.59% +3.83% +1.21% +1.07% Table 1: percent change per day by stock symbol We combine the column values with the same proportions as your trading account. For example, if on Jan 5th 2015 you equaliy invested all of your money in GS, AAPL, GE, and OIL then you would have lost % loss on 2015-01-05 = -3.12*(1/4) - 2.81*(1/4) - 1.83*(1/4) - 6.06*(1/4) At the end of a Monte Carlo simulation we have 1,000,000 values that represent the possible gains and losses. We sort the results and take the 5th percentile, 50th percentile, and 95th percentile to represent the worst-case, average case, and best case scenarios. When you run the below, you'll see this in the output In a single day, this is what could happen to your stock holdings if you have $1000 invested $ % worst case -33 -3.33% most likely scenario -1 -0.14% best case 23 2.28% The code on GitHub also has examples of: How to use Java 8 Lambda Expressions Executing Hive SQL with Spark RDD objects Unit testing Spark code with hadoop-mini-clusters Detailed Step-by-step guide 1. Download and install the HDP Sandbox Download the latest (2.4 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below: 192.168.56.102 sandbox sandbox.hortonworks.com 2. Download code and prerequisites Log into the Sandbox and execute: useradd guest su - hdfs -c "hdfs dfs -mkdir /user/guest; hdfs dfs -chown guest:hdfs /user/guest; " yum install -y java-1.8.0-openjdk-devel.x86_64 #update-alternatives --install /usr/lib/jvm/java java_sdk /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el6_7.x86_64 100 cd /tmp git clone https://github.com/vzlatkin/MonteCarloVarUsingRealData.git 3. Update list of stocks that you own Update companies_list.txt with the list of companies that you own in your stock portfolio and either the portfolio weight (as %/100) or the dollar amount. You should be able to get this information from your broker's website (Fidelity, Scottrade, etc...). Take out any extra commas (,) if you are copying and pasting from the web. The provided sample looks like this: Symbol,Weight or dollar amount (must include $) GE,$250 AAPL,$250 GS,$250 OIL,$250 4. Download historical trading data for the stocks you own Execute: cd /tmp/MonteCarloVarUsingRealData/ /bin/bash downloadHistoricalData.sh # Downloading historical data for GE # Downloading historical data for AAPL # Downloading historical data for GS # Downloading historical data for OIL # Saved to /tmp/stockData/ 5. Run the MonteCarlo simulation Execute: su - guest -c " /usr/hdp/current/spark-client/bin/spark-submit --class com.hortonworks.example.Main --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 --queue default /tmp/MonteCarloVarUsingRealData/target/monte-carlo-var-1.0-SNAPSHOT.jar hdfs:///tmp/stockData/companies_list.txt hdfs:///tmp/stockData/*.csv" Interpreting the Results Below is the result of a sample portfolio that has $1,000 invested equally between Apple, GE, Goldman Sachs, and an ETF that holds crude oil. It says that with 95% certainty, the most that the portfolio can go down in a single day is $33. In addition, there is a 5% chance that the portfolio will gain $23 in a single day. Most of the time, the portfolio will lose $1 per day. In a single day, this is what could happen to your stock holdings if you have $1000 invested $ % worst case -33 -3.33% most likely scenario -1 -0.14% best case 23 2.28%

vzlatkin · ‎05-27-2016

@Constantin Stanca Would you share some of the specific optimizations mentioned in your article? "performance could be improved by ... using the operating system side optimization to take advantage of the most recent hardware NUMA capable."

vzlatkin · ‎04-16-2016

Source article no longer exists. I used this: http://www.r-bloggers.com/interactive-data-science-with-r-in-apache-zeppelin-notebook/

vzlatkin · ‎03-14-2016

Summary Enabling SSL encryption for the Web UIs that make up Hadoop is a tedious process that requires planning, learning to use security tools, and lots of mouse clicks through Ambari's UI. This article aims to simplify the process by presenting a semi-automated, start-to-finish example that enables SSL for the below Web UIs in the Hortonworks Sandbox: Ambari HBase Oozie Ranger HDFS Planning There is no substitute for reading the documentation. If you plan on enabling SSL in a production cluster, then make sure you are familiar with SSL concepts and the communication paths between each HDP component. In addition, plan on cluster downtime. Here are some concepts that you should know well: Certificate Authority (CA) A Certificate Authority is a company that others trust that signs certificates for a fee. On a Mac you can view a list of CAs that your computer trusts by opening up the "Keychain Access" application and clicking on "System Roots". If you don't want to pay one of these companies to sign your certificates, then you can generate your own CA, just beware the Google Chrome and other browsers will present you with a privacy warning. Server SSL certificate These are files that prove the identity of a something, in our case: HDP services. Usually there is one certificate per hostname, and it is signed by a CA. There are two pieces of a certificate: the private and public keys. A private key is needed to encrypt a message and a public certificate is needed to decrypt the same message. Java private keystore When Java HDP services need to encrypt messages, they need a place to look for the private key part of a server's SSL certificate. This keystore holds those private keys. It should be kept secure so that attackers cannot impersonate the service. For this reason, each HDP component in this article has its own private keystore. Java trust keystore Just like my Mac has a list of CAs that it trusts, a Java process on a Linux machine needs the same. This keystore will usually hold the Public CA's certificate and any intermediary CA certificates. If a certificate was signed with a CA that you created yourself then also add the public part of a server's SSL certificate into this keystore. Ranger plugins Ranger plugins communicate with Ranger Admin server over SSL. What is important to understand is where each plugin executes and thus where server SSL certificates are needed. For HDFS, the execution is on the NameNodes, for HBase, it is on the RegionServers, for YARN, it is on the ResourceManagers. When you create server SSL certificates use the hostnames where the plugins execute. Enable SSL on HDP Sandbox This part is rather easy. Install the HDP 2.4 Sandbox and follow the below steps. If you use an older version of the Sandbox note that you'll need to change the Ambari password used in the script. Download my script wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh" Stop all services via Ambari (manually stop HDFS or Turn Off Maintenance Mode) Execute: /bin/bash enable-ssl.sh --all Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to sandbox.hortonworks.com Enable SSL in production There are two big reasons why enabling SSL in production can be more difficult than in a sandbox: If Hadoop components run in Highly Available mode. The solution for most instances is to create a single server SSL certificate and copy it to all HA servers. However, for Oozie you'll need a special server SSL certificate with CN=*.domainname.com If using Public CAs to sign server SSL certificates. Besides adding time to the process that is needed for the CA to sign your certificates you may also need additional steps to add intermediate CA certificates to the various Java trust stores and finding a CA that can sign non-FQDN server SSL certificates for Oozie HA If you are using Ranger to secure anything besides HBase and HDFS then you will need to make changes to the script to enable extra plugins. The steps are similar to enabling SSL in Sanbox: Download my script wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh" Make changes to these variables inside of the script to reflect your cluster layout. The script uses these variables to generate certificates and copy them to all machines where they are needed. Below is an example for my three node cluster. server1="example1.hortonworks.com" server2="example2.hortonworks.com" server3="example3.hortonworks.com" OOZIE_SERVER_ONE=$server2 NAMENODE_SERVER_ONE=$server1 RESOURCE_MANAGER_SERVER_ONE=$server3 HISTORY_SERVER=$server1 HBASE_MASTER_SERVER_ONE=$server2 RANGER_ADMIN_SERVER=$server1 ALL_NAMENODE_SERVERS="${NAMENODE_SERVER_ONE} $server2" ALL_OOZIE_SERVERS="${OOZIE_SERVER_ONE} $server3" ALL_HBASE_MASTER_SERVERS="${HBASE_MASTER_SERVER_ONE} $server3" ALL_HBASE_REGION_SERVERS="$server1 $server2 $server3" ALL_REAL_SERVERS="$server1 $server2 $server3" ALL_HADOOP_SERVERS="$server1 $server2 $server3" export AMBARI_SERVER=$server1 AMBARI_PASS=xxxx CLUSTER_NAME=cluster1 If you are going to pay a Public CA to sign your server SSL certificates then copy them to /tmp/security and name them as such: ca.crt example1.hortonworks.com.crt example1.hortonworks.com.key example2.hortonworks.com.crt example2.hortonworks.com.key example3.hortonworks.com.crt example3.hortonworks.com.key hortonworks.com.crt hortonworks.com.key The last certificate is needed for Oozie if you have Oozie HA enabled. The CN of that certificate should be CN=*.domainname.com as described hereIf you are NOT going to use a Public CA to sign your certificates, then change these lines in the script to be relevant to your organization: /C=US/ST=New York/L=New York City/O=Hortonworks/OU=Consulting/CN=HortonworksCA Stop all services via Ambari Execute: /bin/bash enable-ssl.sh --all Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to $NAMENODE_SERVER_ONE and $HBASE_MASTER_SERVER_ONE that you specified in the above script If you chose not to enable SSL for some components or decide to modify the script to include others (please send me a patch) then be aware of these dependencies: Setting up Ambari trust store is required before enabling SSL encryption for any other component Before you enable HBase SSL encryption, enable Hadoop SSL encryption Validation tips View and verify SSL certificate being used by a server openssl s_client -connect ${OOZIE_SERVER_ONE}:11443 -showcerts < /dev/null View Oozie jobs through command-line oozie jobs -oozie https://${OOZIE_SERVER_ONE}:11443/oozie View certificates stored in a Java keystore keytool -list -storepass password -keystore /etc/hadoop/conf/hadoop-private-keystore.jks View Ranger policies for HDFS cat example1.hortonworks.com.key example1.hortonworks.com.crt >> example1.hortonworks.com.pem curl --cacert /tmp/security/ca.crt --cert /tmp/security/example1.hortonworks.com.pem "https://example1.hortonworks.com:6182/service/plugins/policies/download/cluster1_hadoop?lastKnownVersion=3&[email protected]_hadoop" Validate that Ranger plugins can connect to Ranger admin server by searching for util.PolicyRefresher in HDFS NameNode and HBase RegionServer log files References GitHub repo Documentation to enable SSL for Ambari Oozie HDP documentation and Oozie documentation on apache.org Enable SSL encryption for Hadoop components Documentation for Ranger

vzlatkin · ‎03-02-2016

Problems fixed. There is no longer a step to chroot Solr directory in Zookeeper.

vzlatkin · ‎03-01-2016

@Artem Ervits Thanks for giving this tutorial a try. If you are getting the errors on an HDP Sandbox, would you send me the .vmdk file? I'll take a look and see what needs to change in the tutorial.

vzlatkin · ‎03-01-2016

Yes, I should have added a link to GitHub: https://github.com/vzlatkin/DoctorsNotes

vzlatkin · ‎03-01-2016

Summary Because patients visit many doctors, trends in their ailments and complaints may be difficult to identify. The steps in this article will help you address exactly this problem by creating a TagCloud of the most frequent complaints per patient. Below is a sample: We will generate random HL7 MDM^T02 (v2.3) messages that contain a doctor's note about a fake patient and that patient's fake complaint to their doctor. Apache NiFi will be used to parse these messages and send them to Apache Solr. Finally Banana is used to create the visual dashboard. In the middle of the dashboard is a TagCloud where the more frequently mentioned symptoms for a selected patient appear larger than others. Because this project relies on randomly generated data, some interesting results are possible. In this case, I got lucky and all the symptoms seem related to the patient's most frequent complaint: Morning Alcohol Drinking. The list of all possible symptoms comes from Google searches. Summary of steps Download and install the HDP Sandbox Download and install the latest NiFi release Download the HL7 message generator Create a Solr dashboard to visualize the results Create and execute a new NiFi flow Detailed Step-by-step guide 1. Download and install the HDP Sandbox Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below: 192.168.56.102 sandbox sandbox.hortonworks.com 2. Download and install the latest NiFi release Follow the directions here. These were the steps that I executed for 0.5.1 wget http://apache.cs.utah.edu/nifi/0.5.1/nifi-0.5.1-bin.zip -O /tmp/nifi-0.5.1-bin.zip cd /opt/ unzip /tmp/nifi-0.5.1-bin.zip useradd nifi chown -R nifi:nifi /opt/nifi-0.5.1/ perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.5.1/conf/bootstrap.conf perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.5.1/conf/nifi.properties /opt/nifi-0.5.1/bin/nifi.sh start 3. Download the HL7 message generator A big thank you to HAPI for their excellent library to parse and create HL7 messages on which my code relies. The generator creates a very simple MDM^T02 that includes an in-line note from a doctor. MDM stands for Medical Document Management, and T02 specifies that this is a message for a new document. For more details about this message type read this document. Here is a sample message for Beatrice Cunningham: MSH|^~\&|||||20160229002413.415-0500||MDM^T02|7|P|2.3 EVN|T02|201602290024 PID|1||599992601||cunningham^beatrice^||19290611|F PV1|1|O|Burn center^60^71 TXA|1|CN|TX|20150211002413||||||||DOC-ID-10001|||||AU||AV OBX|1|TX|1001^Reason For Visit: |1|Evaluated patient for skin_scaling. ||||||F As a pre-requisite to executing the code, we need to install Java 8. Execute this on the Sandbox: yum -y install java-1.8.0-openjdk.x86_64 Now, download the pre-build jar file that has the HL7 generator and execute it to create a single message in /tmp/hl7-messages. I chose to store the jar file in /var/ftp/pub because my IDE uploads files during code development. If you change this directory, also change it in the NiFi flow. mkdir -p /var/ftp/pub cd /var/ftp/pub wget https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/target/hl7-generator-1.0-SNAPSHOT-shaded.jar mkdir -p /tmp/hl7-messages/ /usr/lib/jvm/jre-1.8.0/bin/java -cp hl7-generator-1.0-SNAPSHOT-shaded.jar com.hortonworks.example.Main 1 /tmp/hl7-messages chown -R nifi:nifi /tmp/hl7-messages/ 4. Create a Solr dashboard to visualize the results Now we need to configure Solr to ignore some words that don't add value. We do this by modifying stopwords.txt cat <<EOF > /opt/hostname-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/stopwords.txt adjustments Admitted because blood changes complained Discharged Discussed Drew Evaluated for hospital me medication of patient Performed Prescribed Reason Recommended Started tests The to treatment visit Visited was EOF Next, we download the custom dashboard and start Solr in cloud mode export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 wget "https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/other/Chronic%20Symptoms%20(Solr).json" -O /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json /opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181 /opt/hostname-hdpsearch/solr/bin/solr create -c hl7_messages -d data_driven_schema_configs -s 1 -rf 1 5. Create and execute a new NiFi flow Start by downloading this NiFi template to your host machine. To import the template, open the NiFi UI Next, open Templates manager: Click "Browse", then find the template on your local machine, click "Import", and close the Template Window. Drag and drop to instantiate a new template: Double click the new process group called HL7, and start all of the processes. To do so, hold down the Shift-key, and select all of the processes on the screen. Then click the "Start" button: Here is a quick walk through of the processes starting in the top-left corner. First, we use ListFile process to get a directory listing from /tmp/hl7-messages. Second, the FetchFile process reads each file one-by-one, passes the contents to the next step, and deletes if successful. Third, the text file is parsed as an HL7 formatted message. Next, the UpdateAttribute and AttributesToJSON processes get the contents ready for insertion into Solr. Finally, we use the PutSolrContentStream process to add new documents via Solr REST API. The remaining two processes on the very bottom are for spawning the custom Java code and logging details for troubleshooting. Conclusion Now open the Banana UI. You should see a dashboard that looks similar to the screenshot in the beginning of this article. You can see how many messages have been processed by clicking the link in the top-right panel called "Filter By". Troubleshooting If you are not seeing any data in Solr/Banana, then reload the page. Also perform a search via this page to validate that results are being indexed via Solr correctly. Full source code is located in GitHub.

vzlatkin · ‎02-14-2016

For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script: #!/usr/bin/env bash max_depth=5 largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "') printf "%15s %s\n" "bytes" "directory" for ld in $largest_root_dirs; do printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" ) for d in $all_dirs; do line=$(hdfs dfs -du -s $d) size=$(echo $line | cut -d' ' -f1) parent_dir=${d%/*} child=${d##*/} if [ -n "$parent_dir" ]; then leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/') d=${leading_dirs}/$child fi printf "%15.0f %s\n" $size $d done done Sample output: bytes directory 480376973 /hdp 480376973 |---/apps 480376973 |--------/2.3.4.0-3485 98340772 |---------------------/hive 210320342 |---------------------/mapreduce 97380893 |---------------------/pig 15830286 |---------------------/sqoop 58504680 |---------------------/tez 24453973 /user 0 |----/admin 3629715 |----/ambari-qa 3440200 |--------------/.staging 653010 |-----------------------/job_1454293069490_0001

vzlatkin · ‎02-10-2016

I found the documentation on how to do this without downtime: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#DataNode_Hot_Swap_Drive The only challenge that I encountered was the :port: in the command. It is the dfs.datanode.ipc.address parameter from hdfs-site.xml. My full command looked like this su - hdfs -c "hdfs dfsadmin -reconfig datanode sandbox.hortonworks.com:8010 start"

Online	Offline
Last Visited	‎04-23-2020 07:35 PM

Member Since	‎09-29-2015 09:15 PM
Last Visited	‎04-23-2020 07:35 PM
Posts	67
Kudos received	113

Cloudera Community

Predicting stock portfolio losses using Monte Carl...

Re: Apache Spark Performance Improvement on NUMA C...

Re: Apache Zeppelin and SparkR

Quickly enable SSL encryption for Hadoop component...

Re: Visualize patients' complaints to their doctor...

Re: Visualize patients' complaints to their doctor...

Re: Visualize patients' complaints to their doctor...

Visualize patients' complaints to their doctors us...

How to identify what is consuming space in HDFS

Re: How to Move or Change HDFS DataNode Directorie...