Member since 
    
	
		
		
		09-29-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                32
            
            
                Posts
            
        
                55
            
            
                Kudos Received
            
        
                2
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 6929 | 11-26-2015 10:19 PM | |
| 4982 | 11-05-2015 03:22 AM | 
			
    
	
		
		
		12-08-2015
	
		
		05:26 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 From what we have witnessed in the field and during some customers testing, SparkSQL (1.4.x) at the time of testing was generally 50% - %200  faster when querying small datasets, by small we mean anywhere < 100GB  datasets, which is usually great for data discovery, data wrangling, testing stuff out, or even running a production usecase where the datasets tend to be a lot but relatively small.   the bigger the table especially when joins are not effectively used or we are scanning a single one big table, and if you are in the BI space, and SLAs are required and you cant afford a query to break and start over, Tez was able to shine, its rigid stable, and the bigger the datasets the better the performance gets compared to Spark, at a 250GB datasets you will see a lot of similarities on the execution time, of course this will depend on how big is the cluster, how much memory allocated..etc  in general, my personal opinion we shouldn't compare both at this time as both shine in seperate contexts,  at some stage Tez might be needed but maybe more Spark would be required in smaller datasets, and as I mentioned that was based on Spark 1.4.x , would love to re-run the testings again especially after the new cube functionalities in Spark 1.5.  hope this was helpful. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-30-2015
	
		
		09:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 One of the first cases we get to see with Hbase is loading it up with Data, most of the time we will have some sort of data in some format like CSV availalble and we would like to load it in Hbase, lets take a quick look on how does the procedure looks like:  lets examine our example data by looking at the simple structure that I have got for an industrial sensor   id, temp:in,temp:out,vibration,pressure:in,pressure:out
 5842,  50,     30,       4,      240,         340
  First of all make sure Hbase is started on your Sandbox as following  Creating the HBase Table  
 Login as Root to the HDP Sandbox and and switch to the Hbase User    root> su - hbase  
 Go to the Hbase Shell by typing   hbase> hbase shell  
 Create the example table by typing   hbase(main):001:0> create 'sensor','temp','vibration','pressure'  
 lets make sure the table was created and examine the structure by typing   hbase(main):001:0> list  
 now, exit the shell by typing 'exit' and lets load some data   Loading the Data  
 lets put the hbase.csv file in HDFS, you may SCP it first to the cluster by using the following command   macbook-ned> scp hbase.csv root@sandbox.hortonworks.com:/home/hbase  
 now put in HDFS using the following command   hbase> hadoop dfs -copyFromLocal hbase.csv /tmp  
 we shall now execute the Loadtsv statement as following   hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=,  -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv  
 once the mapreduce job is completed, return back to hbase shell and execute    hbase(main):001:0> scan sensor  
 you should now see the data in the table    Remarks  
 Importtsv statement generates massive amount of logs, so make sure you have enough space in /var/logs, its always better to have it mounted on a seperate directories in real cluster to avoid operational stop becuase of logs filling the partition.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		11-26-2015
	
		
		10:19 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 ok here is the latest, The R Interpreter for Zeppelin has not been merged yet with the latest Zeppelin dist. however you can use it now from here https://github.com/apache/incubator-zeppelin/pull/208. All the Best 🙂  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-20-2015
	
		
		05:15 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 This should now be solved, starting Zeppelin 0.5.5 you dont need to rebuild for different Spark/Hadoop versions...  enjoy 🙂 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-11-2015
	
		
		10:33 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							@azeltov@hortonworks.com you can, as long as you modify the ZEPPELIN HUB API TOKEN and you have a direct internet connection from the Sandbox 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-08-2015
	
		
		11:06 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		5 Kudos
		
	
				
		
	
		
					
							 
    Introduction  Hive is one of the most common used databases on Hadoop, users of Hive are doubling  per year due to the amazing enhancements and the addition of Tez and Spark that enabled Hive to by pass the MR era to a an in-memory execution that changed how people are using Hive.  in this blog post, I will show you how to connect squirrel Sql Client to Hive, the concept is similar to any other clients out there as long as you are using the open-source libraries that matches the ones here you should be fine.  Prerequisite  Download Hortonworks Sandbox with HDP 2.2.4, Squirrel SQL Client  Step 1  Follow the Squirrel documentation and run it on your Mac or PC.  Step 2  Follow the Hortonworks HDP Installation on VritualBox, VMware or Hyper-V and start up the virtual Instance.  Step 3  once you are HDP is up and running, connect it it using SSH as it shows on the console, once you are connected you need to download some JAR files in order to establish the connection.  Step 4  if you are using MacOS, simply while you are connected to you HDP instance search for the following JARs using the command:  root> find / -name JAR_FILE  once you find the file needed, easily copy it using SCP to your laptop/PC  root> scp JAR_FILE yourMacUser@yourIPAddress:/PATH_TO_JARS  the files you should look for are the following (versions will differ base on which Sandbox you are running but different versions are unlikely to cause a problem)  
 commons-logging-1.1.3.jar  hive-exec-0.14.0.2.2.4.2-2.jar  hive-jdbc-0.14.0.2.2.4.2-2.jar  hive-service-0.14.0.2.2.4.2-2.jar  httpclient-4.2.5.jar  httpcore-4.2.5.jar  libthrift-0.9.0.jar  slf4j-api-1.7.5.jar  slf4j-log4j12-1.7.5.jar  hadoop-common-2.6.0.2.2.4.2-2.jar   if you are running windows you might need to install winSCP in order to grab the files from their locations.  Step 5  Once all Jars above are downloaded into your local machine, Open up Squirrell and go to Drivers and Add New Driver.      Name: Hive Driver (could be anything else you want) 
 Example URL: jdbc:hive2://localhost:10000/default 
 Class Name: org.apache.hive.jdbc.HiveDriver 
 go to Extra Class Paths and add all the JARS you downloaded  you may change the port no or IP addresses if you are not running with the defaults.  Step 6  login to you Hadoop Sandbox and verify that HIVESERVER2 is running using:  netstat -anp | grep 10000  if there was nothing running you can hiveserver2 manually  hive> hiveserver2  Step 7  once you verify hiveserver2 is up and running you are ready to test the connection on Squirrel by creating a new Alias as following      you are now ready to connect, once connection is successful you should get a screen like this      Step 8 (Optional)  With your first Hive Query, Squirrel can be buggy and complain about memory and heap size, if this ever occurred, if you are on Mac, right click on the app icon-->show package contents-->open info.plist and add the following snippet  <key>Java</key> 
 <dict>
 <key>VMOptions</key> 
 <array> 
 <string>-Xms128m</string> 
 <string>-Xmx512m</string> 
 </array> 
</dict>   Now you can enjoy... 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		11-08-2015
	
		
		10:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		12 Kudos
		
	
				
		
	
		
					
							       Introduction  Apache Zeppelin (Incubator at the time of writing this post) is one of my favourite tools that I try to position and present to anyone interested in Analytics, Its 100% open source with an intelligent international team behind it in Korea (NFLABS) (Moving to San Francisco soon),  its mainly based on interpreter concept that allows any language/data-processing-backend to be plugged into Apache Zeppelin.  Very similar to IPython/Jupyter except that the UI is probably more appealing and the amount of interpreters supported are richer, at the time of writing this Blog Zeppelin supported:  
 Apache Hive QL  ApacheSpark (SQL, Scala and Python)  ApacheFlink  Postgres  Pivotal HAWQ  Shell  Apache Tajo  AngularJS  Apache Cassandra  Apache ignite  Apache Phoenix  Apache Geode  Apache Kylin  Apache Lens   with this rich set of interpreters provided, it makes on boarding platforms like Apache Hadoop or Data Lake concepts much easier where data is sitting and consolidated somewhere and different organizational units with different skill sets needs to access the data and perform their day to day duties on it as data discovery, queries, data modelling, data streaming and finally Data Science using Apache Spark.  Apache Zeppelin Overview  With the notebook style editor and the ability to save notebooks on the fly, you can end up with some really cool notebooks, whether you are a data engineer, data scientist or a BI specialist.       Dataset showing the Health Expenditure of the Australian Government over time by state.   Zeppelin also got a basic clean visualization views integrated with it, it also gives you control over what do you want to include in your graph by dragging and dropping fields in your visualization as below:  
     The sum of government budget healthcare expenditure in Australia by State   Also when you are done with your awesome notebook story, you can easily create a report out of it and either print it or send it out.       Car Accident Fatalities related to Alcohol driving , showing the most fatal days on the streets and the most fatal car accident types during Alcohol times   Playing with Zeppelin  If you have never played with Zeppelin before then visit this link for a quick way  to start working it out using the latest Hortonworks tutorial we are including Zeppelin as part of HDP as a technical preview, which may supporting it officially may follow, check it out  Here try out the different interpreters and how it interacts with Hadoop.  Zeppelin Hub  I was recently given access to the beta version of Hub, Hub is supposed to make life in organizations easier when it comes to sharing notebooks between different departments or pepole within the organization.  Lets assume an Organization got Marketing, BI and Data Science practices, the three departments overlaps with each other when it comes to the datasets being used, therfore there is no need anymore for each department to work completely isolated from the others, as they can share their experience together, brag about their notebooks, work together on the same notebook when trying to work on either complicated notebook or different skills are required.       Zeppelin Hub UI   Lets have a deeper look at Hub...  Hub Instances  Instance is backed by a Zeppelin installation somewhere (server,laptop,hadoop..etc), every time you create a new Instance a new Token is generated, this token should be added in your local Zeppelin installation under folder /incubator_zeppelin/conf/zeppelin-env.sh e.g.  export ZEPPELINHUB_API_TOKEN="f41d1a2b-98f8-XXXX-2575b9b189"  Once the token is added, you will be able to see the notebooks online whenever you connect to Hub (http://zeppelin.hub.com).  Hub Spaces  once an instance is added, you will be able to see all the notebook for each instance, and since every space is actually either a dept. or a category of notebooks that needs to be shared across certain people, you can easily drag and drop notebooks into spaces making them shared across this specific space.       Adding a Notebook to a Space        Showing a Notebook inside Zeppelin Hub   Very cool !  Since its beta, there is still much of work to be done like executing notebooks from Hub directly, resizing and formatting and some other minor issues, I am sure the All Stars team @nflabs will make it happen very soon as they always did.  if you are interested in playing with Beta, you may request access on Apache Zeppelin website here  Hortonworks and Apache Zeppelin  Hortonworks is heavily adopting Apache Zeppelin, that showed in the contribution they have made into the product and into Apache Ambari, @ali  one of Rockstars at Hortonworks created an Apache Zeppelin View on Ambari, which gives Zeppelin authentication and allows users to have a single pane of glass when it comes to uploading datasets using HDFS view on Apache Ambari Views and other operational needs.       Apache Ambari with Zeppelin View Integration        Apache Zeppelin Notebook editor from Apache Ambari   If you want to integrate Zeppelin in Ambari with Apache Spark as well, just easily follow the steps on this link    Hortonworks Gallery for Apache Zeppelin  Recently we have published a Gallery where anyone can contribute and add their notebooks publicly  in order to share their notebooks, all what you need to do is to grab the notebook folder and upload check it out here  If you are not sure how to start, a great way is to take a look at Hortonworks Gallery for Apache Zeppelin, you will be able to have a 360 view on different ways of creating different notebooks       Helium    Project Helium is a revolutionary change in Zeppelin, Helium allows you to integrate almost any standard html, css, javascript as a visualization or a view inside Zeppelin.  Helium Application would consists of an View, Algortihm and an Access to the resource, you can get more information of Helium here 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		11-05-2015
	
		
		03:27 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 make perfect since, wonder if it will work with backward compatibility though,right not I ended up with different Zeppelin folders pointed at different Spark versions 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-05-2015
	
		
		03:22 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 would copying and modifying the interpreter file under /incubator-zeppelin/interpreter folder helps? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		- « Previous
 - 
						
- 1
 - 2
 
 - Next »