Member since
07-18-2016
94
Posts
94
Kudos Received
20
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2586 | 08-11-2017 06:04 PM | |
2443 | 08-02-2017 11:22 PM | |
9782 | 07-10-2017 03:36 PM | |
17972 | 03-17-2017 01:27 AM | |
14821 | 02-24-2017 05:35 PM |
10-13-2016
01:13 PM
1 Kudo
@Deepak Subhramanian I'd recommend upgrading your python version to 2.7 or higher (preferably Anaconda). I was able to recreate your error, and it was resolved when I upgraded from 2.6 to python Anaconda 2.7. Let me know if this does the trick for you!
... View more
09-29-2016
07:00 PM
2 Kudos
@Artem Ervits, @Randy Gelhausen Based on your discussion, I added a few additional lines to the github repo mentioned above. The code will now return older versions of a cell. The maximum number of snapshot versions to fetch is specified in the props file (here).
... View more
09-29-2016
05:37 PM
1 Kudo
For Python: I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages. For R: As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.
... View more
09-22-2016
02:49 PM
That worked great, thanks @Josh Elser! I was not looking at the correct jar. Just as a reference to others, here's the Spark command that I got to work correctly (note: there are multiple ways to add the jar to the classpath): spark-submit --class com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props
... View more
09-22-2016
02:25 PM
1 Kudo
I am running a Spark job in HDP 2.5 against an HBase table, but am getting the following error (below). I've tried a few different ways to include the ServerRpcControllerFactory library in my pom and I also tried to move jars around, but with no luck. Does anyone have suggestions for including the ServerRpcControllerFactory as part of my project? Thanks! Exception in thread "main" java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:420)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:413)
at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:291)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:184)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:150)
at com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad$.main(SparkHBaseBulkLoad.scala:80)
at com.github.zaratsian.SparkHBase.SparkHBaseBulkLoad.main(SparkHBaseBulkLoad.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
... 16 more
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory
at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
at org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:58)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2242)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:690)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:630)
... 21 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
09-19-2016
04:24 PM
3 Kudos
This scala code will create a dataframe and load it into Hive. Hope this helps! // Create dummy data and load it into a DataFrame
case class rowschema(id:Int, record:String)
val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")))
df.registerTempTable("tempTable")
// Create new Hive Table and load tempTable
sqlContext.sql("create table newHiveTable as select * from tempTable")
... View more
09-06-2016
03:41 PM
2 Kudos
Hey Laia, you're close, but it looks like a couple arguments are out of order when you configure the indexer and initial randomforest object. Label_idx is not visible to the randomforest object because the order of execution is off, and as a result it is not in the dataframe ("does not exist"). If you change up the order it should work. I'd recommend de-coupling the indexer and rf object, and execute them as part of the pipeline. Here's the code that I got to work. I also added a few lines at the bottom to show the predictions and accuracy (feel free to modify to fit your requirements). Let me know if this helps.
--- val unparseddata = sc.textFile("hdfs:///tmp/your_data.csv") val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) }.toDF() val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) val nFolds: Int = 10 val NumTrees: Int = 3 val indexer = new StringIndexer().setInputCol("label").setOutputCol("label_idx") val rf = new RandomForestClassifier().setNumTrees(NumTrees).setFeaturesCol("features").setLabelCol("label_idx") val pipeline = new Pipeline().setStages(Array(indexer, rf))
val paramGrid = new ParamGridBuilder().build() val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction") val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds) val model = cv.fit(trainingData) val predictions = model.transform(testData) // Show model predictions predictions.show() val accuracy = evaluator.evaluate(predictions) println("Accuracy: " + accuracy) println("Error Rate: " + (1.0 - accuracy))
... View more
08-29-2016
06:35 PM
1 Kudo
I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out. Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml.
NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code. Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.
... View more
08-18-2016
03:29 PM
Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample
... View more
08-08-2016
09:08 PM
4 Kudos
Hortonworks and SAS have partnered to create two new Apache NiFi processors. These processors allow data/events to be streamed between
Hortonworks DataFlow (HDF) and SAS Event Stream Processing. Why does this matter?
HDF powered by Apache NiFi, Kafka and Storm, is an integrated system for real-time dataflow management and streaming analytics on-premise or in the cloud. SAS Event Stream Processing is a real-time, low-latency, high-throughput event processing solution that can deploy SAS machine learning models. By integrating these technologies, organizations now have the option of deploying their SAS models in real-time within the Hortonworks platform. This offers flexible deployment options for your streaming analytics projects, while providing powerful analytics from SAS.
How does this integration work?
There are two new processors that can be added to NiFi:
ListenESP: This processor initiates a listener within NiFi that receives events from the SAS Event Stream Processing data stream. PutESP: This processor sends events from NiFi to the SAS Event Stream Processing data stream. Setup and configuration:
Download and install Hortonworks DataFlow Copy the SAS .nar file to $NIFI_HOME/lib (This .nar file is provided by SAS when SAS Event Stream Processing is purchased.) Edit $NIFI_HOME/conf/nifi.properties and change the web HTTP port to 31005 ( nifi.web.http.port=31005 ) or another available port of your choice. Start NiFi by running $NIFI_HOME/bin/nifi.sh run Open a browser and go to http://$HOST:31005/nifi
NOTE: For this to work, SAS Event Stream Processing must be purchased and have a valid license.
Once the .nar file has been added, you will have access to the two processors within NiFi. Data events are shared using an Avro schema. Below is a basic example of a NiFi dataflow using both a ListenESP and PutESP (Shown in
Figure 1).
Within the PutESP processor, you'll notice a few parameters (shown below in
Figure 2😞
Pub/Sub Host: Hostname or IP of the server running SAS Event Stream Processing. Pub/Sub Port: Pubsub port of the SAS Event Stream Processing engine. Project: SAS Event Stream Processing project name. Continuous Query: Name of the continuous query within the SAS Event Stream Processing project. Source Window: Source window within SAS Event Stream Processing where events from NiFi can be injected.
The ListenESP processor has similar parameters (shown below in
Figure 3😞
For more information, check out
Hortonworks DataFlow (HDF) powered by Apache NiFi and SAS Event Stream Processing.
... View more
Labels:
- « Previous
- Next »