Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2877 | 08-29-2016 04:42 PM | |
5738 | 08-09-2016 08:43 PM | |
1756 | 07-19-2016 04:08 PM | |
2500 | 07-07-2016 04:05 PM | |
7467 | 06-29-2016 08:25 PM |
05-04-2016
07:07 PM
@Henry : I think that equation uses the executor memory (in your case, 15G) and outputs the overhead value.
// Below calculation uses executorMemory, not memoryOverhead
math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames. I will add that when using Spark on Yarn, the Yarn configuration settings have to be adjusted and tweaked to match up carefully with the Spark properties (as the referenced blog suggests). You might also want to look at Tiered Storage to offload RDDs into MEM_AND_DISK, etc.
... View more
04-27-2016
03:47 PM
3 Kudos
First, I'm assuming you are essentially following the MLlib examples here:
https://spark.apache.org/docs/latest/mllib-linear-methods.html
StepSize is one of the hyper-parameters, or inputs that are arbitrarily selected by the data scientist; i.e. they are not an aspect of the dataset. In order to select the best hyper-parameter values, you can perform a Grid-Search to select the best parameter values. For instance, you can use a list of stepSize values: stepSizeList = {0.1, 0.001, 0.000001} and cycle through each one to see which yields the best model. Here is an article describing hyper-parameter tuning and grid search:
http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning
Quote: "For regularization parameters, it’s common to use exponential scale: 1e-5, 1e-4, 1e-3, … 1. Some guess work is necessary to specify the minimum and maximum values."
... View more
04-26-2016
07:02 PM
The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable. Here is documentation about why: http://comphadoop.weebly.com/
... View more
04-18-2016
05:06 PM
1 Kudo
Zeppelin comes with a long list of interpreters (including Spark/Scala, Python/Pyspark, Hive, Cassandra, SparkSQL, Phoenix, Markdown and Shell), which basically provide language binding to run code that you type into a notebook cell. Currently, the list of interpreters does not include Java, so you will need to compile your code first and build a jar file, which can be submitted to Spark via spark-submit, as described here: http://spark.apache.org/docs/latest/submitting-applications.html
... View more
04-13-2016
07:07 PM
1 Kudo
Spark allocates memory based on option parameters, which can be passed in multiple ways: 1) via the command-line (as you do) 2) via programmatic instructions 3) via the "spark-defaults.conf" file in the "conf" directory under your $SPARK_HOME Second, there are separate config params for the driver and the executors. This is important, because the main difference between "yarn-client" and "yarn-cluster" mode is where the Driver lives (either on the client, or on cluster within the AppMaster). Therefore, we should look at your driver config parameters. It looks like these are your driver-related options from the command-line: --driver-memory 5000m
--driver-cores 2
--conf spark.yarn.driver.memoryOverhead=1024
--conf spark.driver.maxResultSize=5g
--driver-java-options "-XX:MaxPermSize=1000m" It is possible that the AppMaster is running on a node that does not have enough memory to support your option requests, e.g. that the sum of driver-memory (5G) and PermSize (1G), plus overhead (1G) does not fit on the node. I would try lowering the --driver-memory by 1G steps until you no longer get the OOM error.
... View more
04-13-2016
06:58 PM
Another option you can try is clicking on the "Interpreters" link at the top of Zeppelin page, find the "spark" interpreter, and click on the "restart" button on the right-hand side. Next, make sure that your notebook page shows "Connected" with a green dot, meaning it is talking successfully with the Spark driver.
... View more
03-31-2016
08:24 PM
1 Kudo
One caveat: In case you reboot (reset) your VM/Sandbox, you should enable 'ntpd' daemon to start on bootup. I had trouble with GetTwitter as mentioned in the post above, even after following the steps to add ntpd and enable it. However, in the meantime, I had to reboot, which turned it off. To enable it on system bootup, run this command: chkconfig ntpd on
To make sure it was effective, you can run this command to make sure 'ntpd' is enabled in the run modes (2,3,4,5): chkconfig --list | grep ntpd
... View more
03-28-2016
04:05 PM
1 Kudo
In order to run properly on cluster (using one of the 2 described cluster modes), Spark needs to distribute any extra jars that are required at runtime. Normally, the Spark driver sends required jars to the nodes for use by the executors, but that doesn't happen by default for user-supplied or third-party jars (via import statements). Therefore, you have to set one or two parameters, depending on whether the driver and/or the executors need those libs: # Extra Classpath jars
spark.driver.extraClassPath=/home/zeppelin/notebook/jars/guava-11.0.2.jar
spark.executor.extraClassPath=/home/zeppelin/notebook/jars/guava-11.0.2.jar If you are not sure, set both. Finally, the actual jar files should be copied to the specified location. If on the local filesystem, you will have to copy to each node's local fs. If you reference from HDFS, then a single copy will suffice.
... View more
03-23-2016
05:54 PM
8 Kudos
Zeppelin stores all displayable information in a JSON format file named "note.json" (default), located under the home directory, usually /user/zeppelin/notebook. This JSON file includes source code, markup, and output results. Easiest thing to do is: ssh into the machine where zeppelin service is running cd to the notebook directory (cd /user/zeppelin/notebook) cd to the specific notebook sub-directory; each notebook is in separate sub-dir (cd
2A94M5J1Z) edit the note.json file and remove the unwanted results If you use a good editor (like TextMate or vim) that has JSON plugin to format the contents, you can easily locate the results section and rip it out. Make sure you don't break the integrity of the JSON file itself; you just want to eliminate the inner JSON contents where the superfluous result is stored. Here is an example of a results field from note.json: "result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003ch2\u003eWelcome to Zeppelin.\u003c/h2\u003e\n\u003ch5\u003eThis is a live tutorial, you can run the code yourself. (Shift-Enter to Run)\u003c/h5\u003e\n"
},
... View more
03-23-2016
12:14 PM
2 Kudos
I assume you are referring to using Spark's MLlib to train a machine learning model. If so, then I'm betting people are saying that because you have to launch Spark where the client is installed, which is typically on an edge node. The other reason is if they are using Zeppelin to access Spark, then the Zeppelin service and web client would likely be on the management node. However, when you run Spark in cluster modes ("yarn-client" or "yarn-cluster") then the spark job takes advantage of all the Yarn nodes on the cluster. Tuning Spark properly to take advantage of these cluster resources can take some time, and many Spark jobs are not properly tuned. Hope that helps, and that I've understood the question.
... View more