About nshawa

nshawa · ‎12-08-2015

From what we have witnessed in the field and during some customers testing, SparkSQL (1.4.x) at the time of testing was generally 50% - %200 faster when querying small datasets, by small we mean anywhere < 100GB datasets, which is usually great for data discovery, data wrangling, testing stuff out, or even running a production usecase where the datasets tend to be a lot but relatively small. the bigger the table especially when joins are not effectively used or we are scanning a single one big table, and if you are in the BI space, and SLAs are required and you cant afford a query to break and start over, Tez was able to shine, its rigid stable, and the bigger the datasets the better the performance gets compared to Spark, at a 250GB datasets you will see a lot of similarities on the execution time, of course this will depend on how big is the cluster, how much memory allocated..etc in general, my personal opinion we shouldn't compare both at this time as both shine in seperate contexts, at some stage Tez might be needed but maybe more Spark would be required in smaller datasets, and as I mentioned that was based on Spark 1.4.x , would love to re-run the testings again especially after the new cube functionalities in Spark 1.5. hope this was helpful.

nshawa · ‎11-30-2015

One of the first cases we get to see with Hbase is loading it up with Data, most of the time we will have some sort of data in some format like CSV availalble and we would like to load it in Hbase, lets take a quick look on how does the procedure looks like: lets examine our example data by looking at the simple structure that I have got for an industrial sensor id, temp:in,temp:out,vibration,pressure:in,pressure:out 5842, 50, 30, 4, 240, 340 First of all make sure Hbase is started on your Sandbox as following Creating the HBase Table Login as Root to the HDP Sandbox and and switch to the Hbase User root> su - hbase Go to the Hbase Shell by typing hbase> hbase shell Create the example table by typing hbase(main):001:0> create 'sensor','temp','vibration','pressure' lets make sure the table was created and examine the structure by typing hbase(main):001:0> list now, exit the shell by typing 'exit' and lets load some data Loading the Data lets put the hbase.csv file in HDFS, you may SCP it first to the cluster by using the following command macbook-ned> scp hbase.csv root@sandbox.hortonworks.com:/home/hbase now put in HDFS using the following command hbase> hadoop dfs -copyFromLocal hbase.csv /tmp we shall now execute the Loadtsv statement as following hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv once the mapreduce job is completed, return back to hbase shell and execute hbase(main):001:0> scan sensor you should now see the data in the table Remarks Importtsv statement generates massive amount of logs, so make sure you have enough space in /var/logs, its always better to have it mounted on a seperate directories in real cluster to avoid operational stop becuase of logs filling the partition.

nshawa · ‎11-26-2015

ok here is the latest, The R Interpreter for Zeppelin has not been merged yet with the latest Zeppelin dist. however you can use it now from here https://github.com/apache/incubator-zeppelin/pull/208. All the Best 🙂

nshawa · ‎11-20-2015

This should now be solved, starting Zeppelin 0.5.5 you dont need to rebuild for different Spark/Hadoop versions... enjoy 🙂

nshawa · ‎11-11-2015

@azeltov@hortonworks.com you can, as long as you modify the ZEPPELIN HUB API TOKEN and you have a direct internet connection from the Sandbox

nshawa · ‎11-08-2015

Introduction Hive is one of the most common used databases on Hadoop, users of Hive are doubling per year due to the amazing enhancements and the addition of Tez and Spark that enabled Hive to by pass the MR era to a an in-memory execution that changed how people are using Hive. in this blog post, I will show you how to connect squirrel Sql Client to Hive, the concept is similar to any other clients out there as long as you are using the open-source libraries that matches the ones here you should be fine. Prerequisite Download Hortonworks Sandbox with HDP 2.2.4, Squirrel SQL Client Step 1 Follow the Squirrel documentation and run it on your Mac or PC. Step 2 Follow the Hortonworks HDP Installation on VritualBox, VMware or Hyper-V and start up the virtual Instance. Step 3 once you are HDP is up and running, connect it it using SSH as it shows on the console, once you are connected you need to download some JAR files in order to establish the connection. Step 4 if you are using MacOS, simply while you are connected to you HDP instance search for the following JARs using the command: root> find / -name JAR_FILE once you find the file needed, easily copy it using SCP to your laptop/PC root> scp JAR_FILE yourMacUser@yourIPAddress:/PATH_TO_JARS the files you should look for are the following (versions will differ base on which Sandbox you are running but different versions are unlikely to cause a problem) commons-logging-1.1.3.jar hive-exec-0.14.0.2.2.4.2-2.jar hive-jdbc-0.14.0.2.2.4.2-2.jar hive-service-0.14.0.2.2.4.2-2.jar httpclient-4.2.5.jar httpcore-4.2.5.jar libthrift-0.9.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar hadoop-common-2.6.0.2.2.4.2-2.jar if you are running windows you might need to install winSCP in order to grab the files from their locations. Step 5 Once all Jars above are downloaded into your local machine, Open up Squirrell and go to Drivers and Add New Driver. Name: Hive Driver (could be anything else you want) Example URL: jdbc:hive2://localhost:10000/default Class Name: org.apache.hive.jdbc.HiveDriver go to Extra Class Paths and add all the JARS you downloaded you may change the port no or IP addresses if you are not running with the defaults. Step 6 login to you Hadoop Sandbox and verify that HIVESERVER2 is running using: netstat -anp | grep 10000 if there was nothing running you can hiveserver2 manually hive> hiveserver2 Step 7 once you verify hiveserver2 is up and running you are ready to test the connection on Squirrel by creating a new Alias as following you are now ready to connect, once connection is successful you should get a screen like this Step 8 (Optional) With your first Hive Query, Squirrel can be buggy and complain about memory and heap size, if this ever occurred, if you are on Mac, right click on the app icon-->show package contents-->open info.plist and add the following snippet <key>Java</key> <dict> <key>VMOptions</key> <array> <string>-Xms128m</string> <string>-Xmx512m</string> </array> </dict> Now you can enjoy...

nshawa · ‎11-08-2015

Introduction Apache Zeppelin (Incubator at the time of writing this post) is one of my favourite tools that I try to position and present to anyone interested in Analytics, Its 100% open source with an intelligent international team behind it in Korea (NFLABS) (Moving to San Francisco soon), its mainly based on interpreter concept that allows any language/data-processing-backend to be plugged into Apache Zeppelin. Very similar to IPython/Jupyter except that the UI is probably more appealing and the amount of interpreters supported are richer, at the time of writing this Blog Zeppelin supported: Apache Hive QL ApacheSpark (SQL, Scala and Python) ApacheFlink Postgres Pivotal HAWQ Shell Apache Tajo AngularJS Apache Cassandra Apache ignite Apache Phoenix Apache Geode Apache Kylin Apache Lens with this rich set of interpreters provided, it makes on boarding platforms like Apache Hadoop or Data Lake concepts much easier where data is sitting and consolidated somewhere and different organizational units with different skill sets needs to access the data and perform their day to day duties on it as data discovery, queries, data modelling, data streaming and finally Data Science using Apache Spark. Apache Zeppelin Overview With the notebook style editor and the ability to save notebooks on the fly, you can end up with some really cool notebooks, whether you are a data engineer, data scientist or a BI specialist. Dataset showing the Health Expenditure of the Australian Government over time by state. Zeppelin also got a basic clean visualization views integrated with it, it also gives you control over what do you want to include in your graph by dragging and dropping fields in your visualization as below: The sum of government budget healthcare expenditure in Australia by State Also when you are done with your awesome notebook story, you can easily create a report out of it and either print it or send it out. Car Accident Fatalities related to Alcohol driving , showing the most fatal days on the streets and the most fatal car accident types during Alcohol times Playing with Zeppelin If you have never played with Zeppelin before then visit this link for a quick way to start working it out using the latest Hortonworks tutorial we are including Zeppelin as part of HDP as a technical preview, which may supporting it officially may follow, check it out Here try out the different interpreters and how it interacts with Hadoop. Zeppelin Hub I was recently given access to the beta version of Hub, Hub is supposed to make life in organizations easier when it comes to sharing notebooks between different departments or pepole within the organization. Lets assume an Organization got Marketing, BI and Data Science practices, the three departments overlaps with each other when it comes to the datasets being used, therfore there is no need anymore for each department to work completely isolated from the others, as they can share their experience together, brag about their notebooks, work together on the same notebook when trying to work on either complicated notebook or different skills are required. Zeppelin Hub UI Lets have a deeper look at Hub... Hub Instances Instance is backed by a Zeppelin installation somewhere (server,laptop,hadoop..etc), every time you create a new Instance a new Token is generated, this token should be added in your local Zeppelin installation under folder /incubator_zeppelin/conf/zeppelin-env.sh e.g. export ZEPPELINHUB_API_TOKEN="f41d1a2b-98f8-XXXX-2575b9b189" Once the token is added, you will be able to see the notebooks online whenever you connect to Hub (http://zeppelin.hub.com). Hub Spaces once an instance is added, you will be able to see all the notebook for each instance, and since every space is actually either a dept. or a category of notebooks that needs to be shared across certain people, you can easily drag and drop notebooks into spaces making them shared across this specific space. Adding a Notebook to a Space Showing a Notebook inside Zeppelin Hub Very cool ! Since its beta, there is still much of work to be done like executing notebooks from Hub directly, resizing and formatting and some other minor issues, I am sure the All Stars team @nflabs will make it happen very soon as they always did. if you are interested in playing with Beta, you may request access on Apache Zeppelin website here Hortonworks and Apache Zeppelin Hortonworks is heavily adopting Apache Zeppelin, that showed in the contribution they have made into the product and into Apache Ambari, @ali one of Rockstars at Hortonworks created an Apache Zeppelin View on Ambari, which gives Zeppelin authentication and allows users to have a single pane of glass when it comes to uploading datasets using HDFS view on Apache Ambari Views and other operational needs. Apache Ambari with Zeppelin View Integration Apache Zeppelin Notebook editor from Apache Ambari If you want to integrate Zeppelin in Ambari with Apache Spark as well, just easily follow the steps on this link Hortonworks Gallery for Apache Zeppelin Recently we have published a Gallery where anyone can contribute and add their notebooks publicly in order to share their notebooks, all what you need to do is to grab the notebook folder and upload check it out here If you are not sure how to start, a great way is to take a look at Hortonworks Gallery for Apache Zeppelin, you will be able to have a 360 view on different ways of creating different notebooks Helium Project Helium is a revolutionary change in Zeppelin, Helium allows you to integrate almost any standard html, css, javascript as a visualization or a view inside Zeppelin. Helium Application would consists of an View, Algortihm and an Access to the resource, you can get more information of Helium here

nshawa · ‎11-05-2015

make perfect since, wonder if it will work with backward compatibility though,right not I ended up with different Zeppelin folders pointed at different Spark versions

nshawa · ‎11-05-2015

this is awesome! saved me a lot of research 🙂

nshawa · ‎11-05-2015

would copying and modifying the interpreter file under /incubator-zeppelin/interpreter folder helps?

Online	Offline
Last Visited	‎04-20-2017 06:00 AM

Member Since	‎09-29-2015 04:49 AM
Last Visited	‎04-20-2017 06:00 AM
Posts	32
Kudos received	55

Cloudera Community

Re: Zeppelin + SparkR

Re: Is it possible to Clone Interpreter in zeppel...

Re: Spark vs Tez?

Import CSV data into HBase using importtsv

Re: Zeppelin + SparkR

Re: Is it possible to Clone Interpreter in zeppel...

Re: Apache Zeppelin Walk Through

Connecting to Hive Thrift Server on Hortonworks us...

Apache Zeppelin Walk Through

Re: Is it possible to Clone Interpreter in zeppel...

Re: When to Use Hive CSVSerde

Re: Is it possible to Clone Interpreter in zeppel...