Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2875 | 08-29-2016 04:42 PM | |
5738 | 08-09-2016 08:43 PM | |
1756 | 07-19-2016 04:08 PM | |
2498 | 07-07-2016 04:05 PM | |
7467 | 06-29-2016 08:25 PM |
06-27-2016
04:51 PM
@alain TSAFACK
I think you need the --files option to pass the python script to all executor instances. So for example: ./bin/spark-submit --class my.main.Class \
--master yarn-cluster \
--jars my-other-jar.jar,my-other-other-jar.jar
--files return.py
my-main-jar.jar
app_arg1 app_arg2
... View more
06-24-2016
09:28 PM
I was able to run your example on the Hortonworks 2.4 Sandbox (slightly newer version than your 2.3.2). However, it appears you have drastically increased the memory requirements between your 2 examples. You only allocate 512m to the driver and executor in "yarn-client" mode, but allocate 4g and 2g in second example, plus by requesting 3 executors, you will need over 10 GB RAM. Here is the command I actually ran to replicate the "cluster" deploy mode:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --num-executors 1 --driver-memory 1024m --executor-memory 1024m --executor-cores 1 lib/spark-examples*.jar 10
... and here is the result in the Yarn application logs:
Log Type: stdout
Log Upload Time: Fri Jun 24 21:19:42 +0000 2016
Log Length: 23
Pi is roughly 3.142752
Therefore, it is possible your job never was submitted to the run queue since it required too many resources. Please make sure it was not stuck in the 'ACCEPTED' state from the ResourceManager UI.
... View more
06-23-2016
06:31 PM
Agreed, you should at least upgrade the lower HDP version (...2.3.0...) to the newer HDP version (2.3.4.0-3485). It is best to get the default Spark version from the HDP install. Please see Table 1.1 at this link which describes the version associations for HDP, Ambari, and Spark: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/ch_introduction-spark.html
... View more
06-16-2016
06:48 PM
3 Kudos
Spark includes some Jackson libraries as it's own dependencies, including this one: <fasterxml.jackson.version>2.6.5</fasterxml.jackson.version> Therefore, if your additional third-party library also includes this library with a different version, then the classloader will get errors. You can use the Maven Shade plugin to "relocate" the third-party jar, as described here: https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html Here is an example of relocating the "com.fasterxml.jackson" library: http://stackoverflow.com/questions/34764732/relocating-fastxml-jackson-classes-to-my-package-fastxml-jackson
... View more
06-06-2016
07:15 PM
@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.
... View more
05-27-2016
04:04 PM
@Sean Glover The Apache Spark download will allow you to build spark in multiple ways using various build flags to include/exclude components: http://spark.apache.org/docs/latest/building-spark.html Without Hive, you can still create a SQLContext, but it will be native to Spark and not leverage HiveContext. Without a HiveContext, you cannot reference the Hive Metastore, use Hive UDF's etc. Other tools like the Zeppelin data science notebook also default to creating a HiveContext (configurable) so it will need the Hive dependencies.
... View more
05-25-2016
01:45 PM
1 Kudo
Actually, if you don't specify local mode (--master "local") then you will be running in Standalone mode described here:
Standalone mode: By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes. You can limit the number of nodes an application uses by setting the spark.cores.max configuration property in it, or change the default for applications that don’t set this setting through spark.deploy.defaultCores . Finally, in addition to controlling cores, each application’s spark.executor.memory setting controls its memory use. Also, I think you have the port wrong for the Monitor web interface, try using port 4040 instead of 8080, like this: http://<driver-node>:4040
... View more
05-24-2016
04:43 PM
If you are running with deploy mode = yarn (previously, master set to "yarn-client" or "yarn-cluster"), then you can discover the state of the spark job by bringing up the Yarn ResourceManager UI. In Ambari, select Yarn service from left-hand panel, choose "Quick Links", and click on "ResourceManager UI". It will open web page on port 8088. Here is an example (click on 'Applications' in left panel to see all states):
... View more
05-23-2016
06:27 PM
FYI: Here is the quickest way to discover if you have access to your Hive "default" database tables: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tables = sqlContext.sql("show tables")
tables.show()
tables: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|sample_07| false|
|sample_08| false|
+---------+-----------+
... View more
05-23-2016
06:20 PM
2 Kudos
The Spark History Server UI has a link at the bottom called "Show Incomplete Applications". Click on this link and it will show you the running jobs, like zeppelin (see image).
... View more