About khaslbeck

khaslbeck · ‎03-29-2018

An article on the challenges and solutions to predicting machine failures in the field. The full details can be found here: https://github.com/kirkhas/zeppelin-notebooks/tree/master/Preventive_maintenance Step #1 Feature Selection Step #2 Geolocation Step #3 - Scythe is a time-series library authored by Kirk Haslbeck for these purposes - Needed to Resample the data into trips or route segments (Scythe Resample) - Needed to Step Interpolate the miles since last service to be 4K, 5K and less continuous regression Step #4 - Indexing and OneHotEncoding to the Rescue. Found a relationship of a particular "Make" that was more problematic than most. Roc Curve - A near perfect model

khaslbeck · ‎08-29-2016

Can it run both Spark 1.6.1 and Spark 2.0 or just Spark 2.0 ?

khaslbeck · ‎08-24-2016

Brandon Wilson has a great article that shows how to use the "CACHE TABLE" cmd in Tableau, however more recent drivers have come out and you can now connect directly to the thriftserver using a spark-sql driver. This is using HDP 2.5 and SimbaSparkOdbc. First pull up a Tableau connection and select the thriftServer. Additionally had to open the virtualbox port 10015. Next if you don't have the driver Tableau will jump you to a page where you can download a spark-sql driver and inside that package chose this driver. Once you establish a valid connection you will see Tableau flag the connects based on the driver. Below you will see the Hive connection from Brandon's article and now the new Spark connection. Next using the CACHE cmd enter the below into Tableau's initial SQL box. Finally check the storage of spark for the warehouse/crimes table in memory. Or any table of your chosing for that matter. Some visuals from Tableau.

khaslbeck · ‎07-06-2016

Query JSON using Spark Imagine you are ingesting JSON msgs but each one has different tag names or even a different structure. This is very common because JSON is a flexible nested structure. However we commonly interact with data in a flat table like structure using SQL. The decision becomes to either parse the dynamic data into a physical schema (on write) or apply a schema at runtime (on read). Ultimately the decision will likely be made based on the number of writes vs reads. However there is one major advantage to using Spark to apply schema on read to JSON events, it alleviates the parsing step. Typically you have to hand code all the tags in the JSON msgs and map each one to a schema column. This may require meeting with upstream teams or third parties to get the DDL/xsd or schema definition. It also doesn't protect you from msgs you haven't seen or new tags being added to existing JSON structures. Sparks schema on read handles all of this as well as flattens the structure into a SQL queryable table. In the example below there are 3 different JSON msgs each with different tags and structures. If the goal is to normalize the data for a specific reporting or data science task you may be better off defining a physical schema where items like price and strikePrice are converged to a common column that makes sense in both contexts. However if your goal is to process or serve msgs like a msg bus, or if you find that it is better to query stocks separately from options because the attributes should not be interpreted and you do not want to become the author of the data you are processing then this could be an ideal approach. (A non-authoritative, low maintenance approach that is queryable) {"tradeId":"123", "assetClass":"stock", "transType":"buy", "price":"22.34", "stockAttributes":{ "5avg":"20.12","52weekHi":"27.56" } } {"tradeId":"456", "assetClass":"future", "transType":"sell", "strikePrice":"40.00", "contractType": "forward", "account":{ "city":"Columbus","state":"Ohio", "zip":"21000" } } {"tradeId":"789", "assetClass":"option", "transType":"buy", "strikePrice":"35.75", "account":{ "accountType":"retail","city":"Columbus","state":"Ohio" } } 1.0 The below image shows the 3 different JSON msgs (stock,option,future) with different attributes and structures. 2.0 Here you can query all of the data or any segment of the data using SQL. Full code on zephub - code link Pros: Data tags and structure are always in sync with provider No data loss No parsing layer (code effort), faster time to market No authoring, naming or defining columns Cons: SQL reads will be slower than a physically flattened and written table Deserialization cost and can't benefit from modern day columnar operations Compression - "don't use JSON" video from summit https://www.youtube.com/watch?v=tB28rPTvRiI&feature=youtu.be&t=20m3s

khaslbeck · ‎06-10-2016

Predict Stock Portfolio Gains Using Monte Carlo Why? Why create yet another VaR example? To demonstrate VaR running on a modern architecture that has no vertical limit. This is a functional, immutable, scaleable interpretation of a basic technique commonly used in finance. Code Available here and on github. https://github.com/kirkhas/zeppelin-notebooks/ link to Vlad's article for history of Monte Carlo and VaR - https://community.hortonworks.com/articles/36321/predicting-stock-portfolio-losses-using-monte-carl.html Some modifications from original posting include: scala calling Yahoo API directly, alleviating the need for shell scripting and adding interopability between variables. All data loaded dynamically in memory, removing the need to store files (which inherently adds manual customizations to a generic process). Code all in Zeppelin for readability. Visualizations in Zeppelin. Inputs built in using Zep forms so the user can interact with the model. Percentiles not only on what's at risk each day but also on final portfolio value. Figure 1.0 shows the risk you would take on per each day holding these 3 stocks. Figure 2.0 shows what a reasonable projected outcome might be after holding this position for 100 days. Checkout the code it has a lot more visuals. Key takeaway: "You should have purchased shares of HDP in mid Feb 2016!" Code View https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9Nb250ZUNhcmxvVmFyL25vdGUuanNvbg Report View https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9Nb250ZUNhcmxvVmFyL1JlcG9ydFZpZXcvbm90ZS5qc29u

khaslbeck · ‎05-11-2016

You can now visualize any Zeppelin notebook using Zeppelinhub viewer. https://www.zeppelinhub.com/viewer personal likes: 1. No need to sign up or register just paste a link 2. I've been posting my zeppelin notebooks to github but everyone that wants to visualize them or interact with them needs to download, move to environment, import into their instance of zeppelin. Not anymore just paste the link. 3. Less of a need to take screenshots and create a powerpoint just send the hyperlink examples: Stock Variance Notebook github - https://github.com/kirkhas/zeppelin-notebooks/blob/master/stock-variance/note.json vs Stock Variance Notebook zephub - https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9zdG9jay12YXJpYW5jZS9ub3RlLmpzb24 Credit Card Fraud Transactions git - https://github.com/vakshorton/CreditCardTransactionMonitor/blob/master/Zeppelin/notebook/2BGDWYZV9/note.json vs https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3Zha3Nob3J0b24vQ3JlZGl0Q2FyZFRyYW5zYWN0aW9uTW9uaXRvci9tYXN0ZXIvWmVwcGVsaW4vbm90ZWJvb2svMkJHRFdZWlY5L25vdGUuanNvbg

khaslbeck · ‎05-11-2016

In addition to the HWX install guides online, this is a great best practices article for groups that want to consider some design options prior to install. http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger/

Online	Offline
Last Visited	‎08-02-2018 08:10 PM

Member Since	‎02-23-2016 02:08 AM
Last Visited	‎08-02-2018 08:10 PM
Posts	51
Kudos received	90

Cloudera Community

Preventive Maintenance - Machine Cost Avoidance

Re: How to install and run Spark 2.0 on HDP 2.5 Sa...

Tableau on Spark Cache via ThriftServer

JSON to SQL using Spark

Predicting Stock Portfolio Gains using Monte Carlo...

Zeppelinhub Viewer

Best Practices In HDFS Authorization with Apache R...