Member since 
    
	
		
		
		07-18-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                94
            
            
                Posts
            
        
                94
            
            
                Kudos Received
            
        
                20
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3290 | 08-11-2017 06:04 PM | |
| 2983 | 08-02-2017 11:22 PM | |
| 13832 | 07-10-2017 03:36 PM | |
| 19120 | 03-17-2017 01:27 AM | |
| 15881 | 02-24-2017 05:35 PM | 
			
    
	
		
		
		10-13-2016
	
		
		01:13 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Deepak Subhramanian  I'd recommend upgrading your python version to 2.7 or higher (preferably Anaconda).   I was able to recreate your error, and it was resolved when I upgraded from 2.6 to python Anaconda 2.7. Let me know if this does the trick for you! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-29-2016
	
		
		07:00 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 @Artem Ervits, @Randy Gelhausen  Based on your discussion, I added a few additional lines to the github repo mentioned above. The code will now return older versions of a cell. The maximum number of snapshot versions to fetch is specified in the props file (here). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-29-2016
	
		
		05:37 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 For Python:  I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.  For R:  As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-22-2016
	
		
		02:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 That worked great, thanks @Josh Elser! I was not looking at the correct jar. Just as a reference to others, here's the Spark command that I got to work correctly (note: there are multiple ways to add the jar to the classpath):  spark-submit --class com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-22-2016
	
		
		02:25 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 I am running a Spark job in HDP 2.5 against an HBase table, but am getting the following error (below). I've tried a few different ways to include the ServerRpcControllerFactory library in my pom and I also tried to move jars around, but with no luck. Does anyone have suggestions for including the ServerRpcControllerFactory as part of my project? Thanks!  Exception in thread "main" java.io.IOException: java.lang.reflect.InvocationTargetException
	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
	at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:420)
	at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:413)
	at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:291)
	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:184)
	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:150)
	at com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad$.main(SparkHBaseBulkLoad.scala:80)
	at com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad.main(SparkHBaseBulkLoad.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
	... 16 more
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory
	at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
	at org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:58)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2242)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:690)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:630)
	... 21 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache HBase
 - 
						
							
		
			Apache Spark
 
			
    
	
		
		
		09-19-2016
	
		
		04:24 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 
This scala code will create a dataframe and load it into Hive. Hope this helps!  // Create dummy data and load it into a DataFrame
case class rowschema(id:Int, record:String)
val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")))
df.registerTempTable("tempTable")
// Create new Hive Table and load tempTable
sqlContext.sql("create table newHiveTable as select * from tempTable") 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-06-2016
	
		
		03:41 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 
	Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object.   Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work.   I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.   
	---  val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv")  val data = unparseddata.map {       line => val parts = line.split(',').map(_.toDouble)       LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1)))       }.toDF()  val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))  val nFolds: Int = 10  val NumTrees: Int = 3  val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx")  val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx")  val pipeline = new Pipeline().setStages(Array(indexer, rf))   
val paramGrid = new ParamGridBuilder().build()  val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")  val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds)  val model = cv.fit(trainingData)  val predictions = model.transform(testData)  // Show model predictions  predictions.show()  val accuracy = evaluator.evaluate(predictions)  println("Accuracy:   " + accuracy)  println("Error Rate: " + (1.0 - accuracy)) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-29-2016
	
		
		06:35 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.  Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml. 
  NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code.  Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-18-2016
	
		
		03:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-08-2016
	
		
		09:08 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 
 Hortonworks and SAS have partnered to create two new Apache NiFi processors. These processors allow data/events to be streamed between 
 Hortonworks DataFlow (HDF) and SAS Event Stream Processing.  Why does this matter? 
 
 HDF powered by Apache NiFi, Kafka and Storm, is an integrated system for real-time dataflow management and streaming analytics on-premise or in the cloud. SAS Event Stream Processing is a real-time, low-latency, high-throughput event processing solution that can deploy SAS machine learning models. By integrating these technologies, organizations now have the option of deploying their SAS models in real-time within the Hortonworks platform. This offers flexible deployment options for your streaming analytics projects, while providing powerful analytics from SAS.
 
 
  How does this integration work? 
 
 
 There are two new processors that can be added to NiFi:
  
  ListenESP: This processor initiates a listener within NiFi that receives events from the SAS Event Stream Processing data stream.    PutESP: This processor sends events from NiFi to the SAS Event Stream Processing data stream.   Setup and configuration:  
  Download and install Hortonworks DataFlow   Copy the SAS .nar file to  $NIFI_HOME/lib  (This .nar file is provided by SAS when SAS Event Stream Processing is purchased.)   Edit  $NIFI_HOME/conf/nifi.properties  and change the web HTTP port to 31005 ( nifi.web.http.port=31005 ) or another available port of your choice.   Start NiFi by running  $NIFI_HOME/bin/nifi.sh run    Open a browser and go to  http://$HOST:31005/nifi    
 
 NOTE: For this to work, SAS Event Stream Processing must be purchased and have a valid license.  
 Once the .nar file has been added, you will have access to the two processors within NiFi. Data events are shared using an Avro schema. Below is a basic example of a NiFi dataflow using both a ListenESP and PutESP (Shown in 
 Figure 1).  
     
 Within the PutESP processor, you'll notice a few parameters (shown below in 
 Figure 2😞  
  Pub/Sub Host: Hostname or IP of the server running SAS Event Stream Processing.   Pub/Sub Port: Pubsub port of the SAS Event Stream Processing engine.   Project: SAS Event Stream Processing project name.   Continuous Query: Name of the continuous query within the SAS Event Stream Processing project.   Source Window: Source window within SAS Event Stream Processing where events from NiFi can be injected.     
 The ListenESP processor has similar parameters (shown below in 
 Figure 3😞  
     
 For more information, check out 
 Hortonworks DataFlow (HDF) powered by Apache NiFi and SAS Event Stream Processing. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
- « Previous
 - Next »