About phargis

phargis · ‎03-17-2016

Have you tried using this Spark syntax described here: http://phoenix.apache.org/phoenix_spark.html import org.apache.spark.SparkContext import org.apache.phoenix.spark._ val sc = new SparkContext("local", "phoenix-test") val dataSet = List((1L, "1", 1), (2L, "2", 2), (3L, "3", 3)) sc .parallelize(dataSet) .saveToPhoenix( "OUTPUT_TEST_TABLE", Seq("ID","COL1","COL2"), zkUrl = Some("phoenix-server:2181") )

phargis · ‎02-23-2016

You can either run Spark natively and declare a SparkR context, via sparkR.init(), or use RStudio for IDE access. Instructions for both are included here: https://spark.apache.org/docs/latest/sparkr.html

phargis · ‎02-17-2016

Spark has a PySpark class that acts as a wrapper around Spark's scala-based libraries. It also provides REPL interface for the python interpreter. If you launch pySpark, you will be able to import whatever python libraries you have installed locally, i.e. python imports should work. Specifically (from the docs): PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython. By default, PySpark requires python to be available on the system PATH and use it to run programs; an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable in conf/spark-env.sh (or .cmd on Windows). All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported. Standalone PySpark applications should be run using the bin/pyspark script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh or .cmd . The script automatically adds the bin/pyspark package to the PYTHONPATH .

phargis · ‎02-17-2016

From the Storm TruckEvents tutorial, here is the Hbase pom reference for inclusion of jar files into the Storm deploy-jar. Beware that this is for HDP 2.2 example: ... <hbase.version>0.98.0.2.1.1.0-385-hadoop2</hbase.version> ...  <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>${hbase.version}</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>log4j-over-slf4j</artifactId> </exclusion> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> Therefore, you need to include both the Hbase jars in the uber Storm jar, as well as the config file as shown in Ali's answer above.

phargis · ‎02-07-2016

Leave storm/lib vanilla and package the higher versions of those libs with your topology jar using maven shade to relocate the necessary packages to avoid conflict. Here is an example from a storm topology that relocates guava (com.google.common). <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>${version.shade}</version> <configuration> <relocations> <relocation> <pattern>com.google.common</pattern> <shadedPattern>com.cisco.com.google.common</shadedPattern> </relocation> </relocations> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <artifactSet> <excludes> <exclude>org.datanucleus</exclude> </excludes> </artifactSet> <promoteTransitiveDependencies>true</promoteTransitiveDependencies> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" /> </transformers> </configuration> </execution> </executions> </plugin>

phargis · ‎02-04-2016

This tutorial skipped one set of instructions to eliminate use of Yarn HistoryServer. These are the required steps: Ensure “spark-defaults.conf” doesn’t have any YARN history service related properties enabled. If this tech preview is installed on a node where Spark was already there, there may be Spark properties set related to YARN ATS. Make sure you have disabled the following properties in your “spark-defaults.conf” file by adding a '#' in front of each setting. #spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider #spark.history.ui.port 18080 #spark.yarn.historyServer.address sandbox.hortonworks.com:18080 #spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService

phargis · ‎01-27-2016

Workaround for Hive queries OutOfMemory errors: Please note that in some cases (such as when running the Hortonworks Sandbox on Microsoft Azure VM and allocating ‘A4’ VM machine), some of the Hive queries will produce OutOfMemory (Java Heap) errors. As a workaround, you can adjust some Hive-Tez config parameters using Ambari console. Go to the Services–>Hive page, click on ‘Configs’ tab, and make the following changes: 1) Scroll down to Optimization section, change Tez Container Size, increasing from 200 to 512 Param: “hive.tez.container.size” Value: 512 2) Click on “Advanced” tab to show extra settings, scroll down to find parameter “hive.tez.java.opts”, and change Hive-Tez Java Opts by increasing Java Heap Max size from 200MB to 512MB: Param: “hive.tez.java.opts” Value: “-server -Xmx512m -Djava.net.preferIPv4Stack=true”

phargis · ‎01-27-2016

Some users are getting OutOfMemory errors when running the "Getting Started with HDP" tutorial on Hortonworks website: http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#section_1 What is the suggested workaround, especially when running in a limited memory environment like the Sandbox?

phargis · ‎01-21-2016

This is not a complete answer, but would like to also add that, by default, Kafka brokers write to local storage (not HDFS), and therefore, benefit from fast local disk (SSD) and/or multiple spindles to parallelize writes to partitions. I don't know of a formula to calculate this, but try to maximize I/O throughput to disk, and allocate # spindles up to the # of available CPUs per node. Lots of Hadoop architectures don't really specify allocation for local storage (beyond OS disk), and therefore it is something to be aware of.

phargis · ‎01-07-2016

Actually, many BI vendors including Tableau have announced a Spark Connector over JDBC, which should presumably be able to leverage data loaded into RDD's in memory. If you load data via Spark Streaming into RDD, then either schematize it (rdd.registerTempTable) or convert to DataFrame (rdd.toDF), you should be able to query that data from a JDBC connection and display in dashboard. Here is info on Tableau connector, including a video at bottom of page: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&so...

Online	Offline
Last Visited	‎10-04-2016 10:20 PM

Member Since	‎09-24-2015 01:55 PM
Last Visited	‎10-04-2016 10:20 PM
Posts	98
Kudos received	73

Cloudera Community

Re: Fuzzy Algorithm in Apache Spark

Re: How to tune Spark for parallel processing when...

Re: Social Network Analysis using Spark MLLIB

Re: Configuring YARN queues for Spark notebooks

Re: Can Dataframe joins in Spark preserve order?

Re: I'm getting an error trying to write from Spar...

Re: How to use R with spark

Re: Spark R and Python Libraries

Re: How to connect Storm to HBase ?

Re: STORM : How do I fix the google.guava dependen...

Re: Spark 1.6 Technical Preview

Re: What is the workaround when getting Hive OutOf...

What is the workaround when getting Hive OutOfMemo...

Re: Best Practices for Storm Deployment on a Hadoo...

Re: Use of Spark Streaming for interactive Reporti...