Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2855 | 08-29-2016 04:42 PM | |
5696 | 08-09-2016 08:43 PM | |
1743 | 07-19-2016 04:08 PM | |
2467 | 07-07-2016 04:05 PM | |
7414 | 06-29-2016 08:25 PM |
03-17-2016
08:02 PM
Have you tried using this Spark syntax described here: http://phoenix.apache.org/phoenix_spark.html import org.apache.spark.SparkContext
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val dataSet = List((1L, "1", 1), (2L, "2", 2), (3L, "3", 3))
sc
.parallelize(dataSet)
.saveToPhoenix(
"OUTPUT_TEST_TABLE",
Seq("ID","COL1","COL2"),
zkUrl = Some("phoenix-server:2181")
)
... View more
02-23-2016
03:19 PM
3 Kudos
You can either run Spark natively and declare a SparkR context, via sparkR.init(), or use RStudio for IDE access. Instructions for both are included here: https://spark.apache.org/docs/latest/sparkr.html
... View more
02-17-2016
10:46 PM
2 Kudos
Spark has a PySpark class that acts as a wrapper around Spark's scala-based libraries. It also provides REPL interface for the python interpreter. If you launch pySpark, you will be able to import whatever python libraries you have installed locally, i.e. python imports should work. Specifically (from the docs): PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython. By default, PySpark requires python to be available on the system PATH and use it to run programs; an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable in conf/spark-env.sh (or .cmd on Windows). All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported. Standalone PySpark applications should be run using the bin/pyspark script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh or .cmd . The script automatically adds the bin/pyspark package to the PYTHONPATH .
... View more
02-17-2016
04:19 PM
1 Kudo
From the Storm TruckEvents tutorial, here is the Hbase pom reference for inclusion of jar files into the Storm deploy-jar. Beware that this is for HDP 2.2 example: ...
<hbase.version>0.98.0.2.1.1.0-385-hadoop2</hbase.version>
...
<!-- HBase Dependcies -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency> Therefore, you need to include both the Hbase jars in the uber Storm jar, as well as the config file as shown in Ali's answer above.
... View more
02-07-2016
01:38 AM
2 Kudos
Leave storm/lib vanilla and package the higher versions of those libs with your topology jar using maven shade to relocate the necessary packages to avoid conflict. Here is an example from a storm topology that relocates guava (com.google.common). <plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>${version.shade}</version>
<configuration>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>com.cisco.com.google.common</shadedPattern>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<artifactSet>
<excludes>
<exclude>org.datanucleus</exclude>
</excludes>
</artifactSet>
<promoteTransitiveDependencies>true</promoteTransitiveDependencies>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"
/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
... View more
02-04-2016
08:03 PM
1 Kudo
This tutorial skipped one set of instructions to eliminate use of Yarn HistoryServer. These are the required steps: Ensure “spark-defaults.conf” doesn’t have any YARN history service related properties enabled. If this tech preview is installed on a node where Spark was already there, there may be Spark properties set related to YARN ATS. Make sure you have disabled the following properties in your “spark-defaults.conf” file by adding a '#' in front of each setting.
#spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider #spark.history.ui.port 18080 #spark.yarn.historyServer.address sandbox.hortonworks.com:18080 #spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
... View more
01-27-2016
09:49 PM
3 Kudos
Workaround for Hive queries OutOfMemory errors: Please note that in some cases (such as when running the Hortonworks Sandbox on Microsoft Azure VM and allocating ‘A4’ VM machine), some of the Hive queries will produce OutOfMemory (Java Heap) errors. As a workaround, you can adjust some Hive-Tez config parameters using Ambari console. Go to the Services–>Hive page, click on ‘Configs’ tab, and make the following changes: 1) Scroll down to Optimization section, change Tez Container Size, increasing from 200 to 512
Param: “hive.tez.container.size” Value: 512 2) Click on “Advanced” tab to show extra settings, scroll down to find parameter “hive.tez.java.opts”, and change Hive-Tez Java Opts by increasing Java Heap Max size from 200MB to 512MB:
Param: “hive.tez.java.opts” Value: “-server -Xmx512m -Djava.net.preferIPv4Stack=true”
... View more
01-27-2016
09:47 PM
1 Kudo
Some users are getting OutOfMemory errors when running the "Getting Started with HDP" tutorial on Hortonworks website: http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#section_1 What is the suggested workaround, especially when running in a limited memory environment like the Sandbox?
... View more
Labels:
- Labels:
-
Apache Hive
01-21-2016
10:52 PM
1 Kudo
This is not a complete answer, but would like to also add that, by default, Kafka brokers write to local storage (not HDFS), and therefore, benefit from fast local disk (SSD) and/or multiple spindles to parallelize writes to partitions. I don't know of a formula to calculate this, but try to maximize I/O throughput to disk, and allocate # spindles up to the # of available CPUs per node. Lots of Hadoop architectures don't really specify allocation for local storage (beyond OS disk), and therefore it is something to be aware of.
... View more
01-07-2016
04:10 PM
Actually, many BI vendors including Tableau have announced a Spark Connector over JDBC, which should presumably be able to leverage data loaded into RDD's in memory. If you load data via Spark Streaming into RDD, then either schematize it (rdd.registerTempTable) or convert to DataFrame (rdd.toDF), you should be able to query that data from a JDBC connection and display in dashboard. Here is info on Tableau connector, including a video at bottom of page: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&so...
... View more
- « Previous
- Next »