About phargis

phargis · ‎06-27-2016

@alain TSAFACK I think you need the --files option to pass the python script to all executor instances. So for example: ./bin/spark-submit --class my.main.Class \ --master yarn-cluster \ --jars my-other-jar.jar,my-other-other-jar.jar --files return.py my-main-jar.jar app_arg1 app_arg2

phargis · ‎06-24-2016

I was able to run your example on the Hortonworks 2.4 Sandbox (slightly newer version than your 2.3.2). However, it appears you have drastically increased the memory requirements between your 2 examples. You only allocate 512m to the driver and executor in "yarn-client" mode, but allocate 4g and 2g in second example, plus by requesting 3 executors, you will need over 10 GB RAM. Here is the command I actually ran to replicate the "cluster" deploy mode: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --num-executors 1 --driver-memory 1024m --executor-memory 1024m --executor-cores 1 lib/spark-examples*.jar 10 ... and here is the result in the Yarn application logs: Log Type: stdout Log Upload Time: Fri Jun 24 21:19:42 +0000 2016 Log Length: 23 Pi is roughly 3.142752 Therefore, it is possible your job never was submitted to the run queue since it required too many resources. Please make sure it was not stuck in the 'ACCEPTED' state from the ResourceManager UI.

phargis · ‎06-23-2016

Agreed, you should at least upgrade the lower HDP version (...2.3.0...) to the newer HDP version (2.3.4.0-3485). It is best to get the default Spark version from the HDP install. Please see Table 1.1 at this link which describes the version associations for HDP, Ambari, and Spark: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/ch_introduction-spark.html

phargis · ‎06-16-2016

Spark includes some Jackson libraries as it's own dependencies, including this one: <fasterxml.jackson.version>2.6.5</fasterxml.jackson.version> Therefore, if your additional third-party library also includes this library with a different version, then the classloader will get errors. You can use the Maven Shade plugin to "relocate" the third-party jar, as described here: https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html Here is an example of relocating the "com.fasterxml.jackson" library: http://stackoverflow.com/questions/34764732/relocating-fastxml-jackson-classes-to-my-package-fastxml-jackson

phargis · ‎06-06-2016

@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.

phargis · ‎05-27-2016

@Sean Glover The Apache Spark download will allow you to build spark in multiple ways using various build flags to include/exclude components: http://spark.apache.org/docs/latest/building-spark.html Without Hive, you can still create a SQLContext, but it will be native to Spark and not leverage HiveContext. Without a HiveContext, you cannot reference the Hive Metastore, use Hive UDF's etc. Other tools like the Zeppelin data science notebook also default to creating a HiveContext (configurable) so it will need the Hive dependencies.

phargis · ‎05-25-2016

Actually, if you don't specify local mode (--master "local") then you will be running in Standalone mode described here: Standalone mode: By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes. You can limit the number of nodes an application uses by setting the spark.cores.max configuration property in it, or change the default for applications that don’t set this setting through spark.deploy.defaultCores . Finally, in addition to controlling cores, each application’s spark.executor.memory setting controls its memory use. Also, I think you have the port wrong for the Monitor web interface, try using port 4040 instead of 8080, like this: http://<driver-node>:4040

phargis · ‎05-24-2016

If you are running with deploy mode = yarn (previously, master set to "yarn-client" or "yarn-cluster"), then you can discover the state of the spark job by bringing up the Yarn ResourceManager UI. In Ambari, select Yarn service from left-hand panel, choose "Quick Links", and click on "ResourceManager UI". It will open web page on port 8088. Here is an example (click on 'Applications' in left panel to see all states):

phargis · ‎05-23-2016

FYI: Here is the quickest way to discover if you have access to your Hive "default" database tables: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val tables = sqlContext.sql("show tables") tables.show() tables: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean] +---------+-----------+ |tableName|isTemporary| +---------+-----------+ |sample_07| false| |sample_08| false| +---------+-----------+

phargis · ‎05-23-2016

The Spark History Server UI has a link at the bottom called "Show Incomplete Applications". Click on this link and it will show you the running jobs, like zeppelin (see image).

Online	Offline
Last Visited	‎10-04-2016 10:20 PM

Member Since	‎09-24-2015 01:55 PM
Last Visited	‎10-04-2016 10:20 PM
Posts	98
Kudos received	73

Cloudera Community

Re: Fuzzy Algorithm in Apache Spark

Re: How to tune Spark for parallel processing when...

Re: Social Network Analysis using Spark MLLIB

Re: Configuring YARN queues for Spark notebooks

Re: Can Dataframe joins in Spark preserve order?

Re: run a python script containing commands spark

Re: Spark Job Failing "Could not find or load main...

Re: Need help on Spark Upgrade

Re: Zeppelin Spark Maxmind jackson.databind NoSuch...

Re: What is the Best Practice for Loading Files in...

Re: Why does Ambari force me to install Hive when ...

Re: Spark strange behavior: Im executing a query a...

Re: List all created spark jobs

Re: Apache Zeppelin with Hive

Re: List all created spark jobs