About dzaratsian

dzaratsian · ‎10-13-2016

@Deepak Subhramanian I'd recommend upgrading your python version to 2.7 or higher (preferably Anaconda). I was able to recreate your error, and it was resolved when I upgraded from 2.6 to python Anaconda 2.7. Let me know if this does the trick for you!

dzaratsian · ‎09-29-2016

@Artem Ervits, @Randy Gelhausen Based on your discussion, I added a few additional lines to the github repo mentioned above. The code will now return older versions of a cell. The maximum number of snapshot versions to fetch is specified in the props file (here).

dzaratsian · ‎09-29-2016

For Python: I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages. For R: As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.

dzaratsian · ‎09-22-2016

That worked great, thanks @Josh Elser! I was not looking at the correct jar. Just as a reference to others, here's the Spark command that I got to work correctly (note: there are multiple ways to add the jar to the classpath): spark-submit --class com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props

dzaratsian · ‎09-22-2016

I am running a Spark job in HDP 2.5 against an HBase table, but am getting the following error (below). I've tried a few different ways to include the ServerRpcControllerFactory library in my pom and I also tried to move jars around, but with no luck. Does anyone have suggestions for including the ServerRpcControllerFactory as part of my project? Thanks! Exception in thread "main" java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240) at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:420) at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:413) at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:291) at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:184) at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:150) at com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad$.main(SparkHBaseBulkLoad.scala:80) at com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad.main(SparkHBaseBulkLoad.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238) ... 16 more Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36) at org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:58) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2242) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:690) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:630) ... 21 more Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)

dzaratsian · ‎09-19-2016

This scala code will create a dataframe and load it into Hive. Hope this helps! // Create dummy data and load it into a DataFrame case class rowschema(id:Int, record:String) val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3"))) df.registerTempTable("tempTable") // Create new Hive Table and load tempTable sqlContext.sql("create table newHiveTable as select * from tempTable")

dzaratsian · ‎09-06-2016

Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object. Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work. I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps. --- val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv") val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }.toDF() val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) val nFolds: Int = 10 val NumTrees: Int = 3 val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx") val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx") val pipeline = new Pipeline().setStages(Array(indexer, rf)) val paramGrid = new ParamGridBuilder().build() val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction") val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds) val model = cv.fit(trainingData) val predictions = model.transform(testData) // Show model predictions predictions.show() val accuracy = evaluator.evaluate(predictions) println("Accuracy: " + accuracy) println("Error Rate: " + (1.0 - accuracy))

dzaratsian · ‎08-29-2016

I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out. Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml. NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code. Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.

dzaratsian · ‎08-18-2016

Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample

dzaratsian · ‎08-08-2016

Hortonworks and SAS have partnered to create two new Apache NiFi processors. These processors allow data/events to be streamed between Hortonworks DataFlow (HDF) and SAS Event Stream Processing. Why does this matter? HDF powered by Apache NiFi, Kafka and Storm, is an integrated system for real-time dataflow management and streaming analytics on-premise or in the cloud. SAS Event Stream Processing is a real-time, low-latency, high-throughput event processing solution that can deploy SAS machine learning models. By integrating these technologies, organizations now have the option of deploying their SAS models in real-time within the Hortonworks platform. This offers flexible deployment options for your streaming analytics projects, while providing powerful analytics from SAS. How does this integration work? There are two new processors that can be added to NiFi: ListenESP: This processor initiates a listener within NiFi that receives events from the SAS Event Stream Processing data stream. PutESP: This processor sends events from NiFi to the SAS Event Stream Processing data stream. Setup and configuration: Download and install Hortonworks DataFlow Copy the SAS .nar file to $NIFI_HOME/lib (This .nar file is provided by SAS when SAS Event Stream Processing is purchased.) Edit $NIFI_HOME/conf/nifi.properties and change the web HTTP port to 31005 ( nifi.web.http.port=31005 ) or another available port of your choice. Start NiFi by running $NIFI_HOME/bin/nifi.sh run Open a browser and go to http://$HOST:31005/nifi NOTE: For this to work, SAS Event Stream Processing must be purchased and have a valid license. Once the .nar file has been added, you will have access to the two processors within NiFi. Data events are shared using an Avro schema. Below is a basic example of a NiFi dataflow using both a ListenESP and PutESP (Shown in Figure 1). Within the PutESP processor, you'll notice a few parameters (shown below in Figure 2😞 Pub/Sub Host: Hostname or IP of the server running SAS Event Stream Processing. Pub/Sub Port: Pubsub port of the SAS Event Stream Processing engine. Project: SAS Event Stream Processing project name. Continuous Query: Name of the continuous query within the SAS Event Stream Processing project. Source Window: Source window within SAS Event Stream Processing where events from NiFi can be injected. The ListenESP processor has similar parameters (shown below in Figure 3😞 For more information, check out Hortonworks DataFlow (HDF) powered by Apache NiFi and SAS Event Stream Processing.

Online	Offline
Last Visited	‎04-12-2018 06:53 PM

Member Since	‎07-18-2016 12:31 PM
Last Visited	‎04-12-2018 06:53 PM
Posts	94
Kudos received	90

Cloudera Community

Re: Installation and usage of Databricks Spark-CSV...

Re: Spark-sklearn integration

Re: How to make hive queries including scala and p...

Re: Structured Streaming writestream append to fil...

Re: Combine csv files with one header in a csv fi...

Re: Graphframes with pyspark

Re: hbase snapshot accessing older version of cell

Re: SPARK PYSPARK SPARKR : Question Versionning

Re: Error: Unable to find org.apache.hadoop.hbase....

Error: Unable to find org.apache.hadoop.hbase.ipc....

Re: How to Create/LOAD data into table through spa...

Re: error: java.lang.IllegalArgumentException: Fie...

Re: Running a web scraper on Hadoop

Re: How to Query Hbase Snapshot (in HDFS) from Spa...

Integrating Hortonworks DataFlow (powered by Apach...