About phargis

phargis · ‎05-04-2016

@Henry : I think that equation uses the executor memory (in your case, 15G) and outputs the overhead value. // Below calculation uses executorMemory, not memoryOverhead math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames. I will add that when using Spark on Yarn, the Yarn configuration settings have to be adjusted and tweaked to match up carefully with the Spark properties (as the referenced blog suggests). You might also want to look at Tiered Storage to offload RDDs into MEM_AND_DISK, etc.

phargis · ‎04-27-2016

First, I'm assuming you are essentially following the MLlib examples here: https://spark.apache.org/docs/latest/mllib-linear-methods.html StepSize is one of the hyper-parameters, or inputs that are arbitrarily selected by the data scientist; i.e. they are not an aspect of the dataset. In order to select the best hyper-parameter values, you can perform a Grid-Search to select the best parameter values. For instance, you can use a list of stepSize values: stepSizeList = {0.1, 0.001, 0.000001} and cycle through each one to see which yields the best model. Here is an article describing hyper-parameter tuning and grid search: http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning Quote: "For regularization parameters, it’s common to use exponential scale: 1e-5, 1e-4, 1e-3, … 1. Some guess work is necessary to specify the minimum and maximum values."

phargis · ‎04-26-2016

The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable. Here is documentation about why: http://comphadoop.weebly.com/

phargis · ‎04-18-2016

Zeppelin comes with a long list of interpreters (including Spark/Scala, Python/Pyspark, Hive, Cassandra, SparkSQL, Phoenix, Markdown and Shell), which basically provide language binding to run code that you type into a notebook cell. Currently, the list of interpreters does not include Java, so you will need to compile your code first and build a jar file, which can be submitted to Spark via spark-submit, as described here: http://spark.apache.org/docs/latest/submitting-applications.html

phargis · ‎04-13-2016

Spark allocates memory based on option parameters, which can be passed in multiple ways: 1) via the command-line (as you do) 2) via programmatic instructions 3) via the "spark-defaults.conf" file in the "conf" directory under your $SPARK_HOME Second, there are separate config params for the driver and the executors. This is important, because the main difference between "yarn-client" and "yarn-cluster" mode is where the Driver lives (either on the client, or on cluster within the AppMaster). Therefore, we should look at your driver config parameters. It looks like these are your driver-related options from the command-line: --driver-memory 5000m --driver-cores 2 --conf spark.yarn.driver.memoryOverhead=1024 --conf spark.driver.maxResultSize=5g --driver-java-options "-XX:MaxPermSize=1000m" It is possible that the AppMaster is running on a node that does not have enough memory to support your option requests, e.g. that the sum of driver-memory (5G) and PermSize (1G), plus overhead (1G) does not fit on the node. I would try lowering the --driver-memory by 1G steps until you no longer get the OOM error.

phargis · ‎04-13-2016

Another option you can try is clicking on the "Interpreters" link at the top of Zeppelin page, find the "spark" interpreter, and click on the "restart" button on the right-hand side. Next, make sure that your notebook page shows "Connected" with a green dot, meaning it is talking successfully with the Spark driver.

phargis · ‎03-31-2016

One caveat: In case you reboot (reset) your VM/Sandbox, you should enable 'ntpd' daemon to start on bootup. I had trouble with GetTwitter as mentioned in the post above, even after following the steps to add ntpd and enable it. However, in the meantime, I had to reboot, which turned it off. To enable it on system bootup, run this command: chkconfig ntpd on To make sure it was effective, you can run this command to make sure 'ntpd' is enabled in the run modes (2,3,4,5): chkconfig --list | grep ntpd

phargis · ‎03-28-2016

In order to run properly on cluster (using one of the 2 described cluster modes), Spark needs to distribute any extra jars that are required at runtime. Normally, the Spark driver sends required jars to the nodes for use by the executors, but that doesn't happen by default for user-supplied or third-party jars (via import statements). Therefore, you have to set one or two parameters, depending on whether the driver and/or the executors need those libs: # Extra Classpath jars spark.driver.extraClassPath=/home/zeppelin/notebook/jars/guava-11.0.2.jar spark.executor.extraClassPath=/home/zeppelin/notebook/jars/guava-11.0.2.jar If you are not sure, set both. Finally, the actual jar files should be copied to the specified location. If on the local filesystem, you will have to copy to each node's local fs. If you reference from HDFS, then a single copy will suffice.

phargis · ‎03-23-2016

Zeppelin stores all displayable information in a JSON format file named "note.json" (default), located under the home directory, usually /user/zeppelin/notebook. This JSON file includes source code, markup, and output results. Easiest thing to do is: ssh into the machine where zeppelin service is running cd to the notebook directory (cd /user/zeppelin/notebook) cd to the specific notebook sub-directory; each notebook is in separate sub-dir (cd 2A94M5J1Z) edit the note.json file and remove the unwanted results If you use a good editor (like TextMate or vim) that has JSON plugin to format the contents, you can easily locate the results section and rip it out. Make sure you don't break the integrity of the JSON file itself; you just want to eliminate the inner JSON contents where the superfluous result is stored. Here is an example of a results field from note.json: "result": { "code": "SUCCESS", "type": "HTML", "msg": "\u003ch2\u003eWelcome to Zeppelin.\u003c/h2\u003e\n\u003ch5\u003eThis is a live tutorial, you can run the code yourself. (Shift-Enter to Run)\u003c/h5\u003e\n" },

phargis · ‎03-23-2016

I assume you are referring to using Spark's MLlib to train a machine learning model. If so, then I'm betting people are saying that because you have to launch Spark where the client is installed, which is typically on an edge node. The other reason is if they are using Zeppelin to access Spark, then the Zeppelin service and web client would likely be on the management node. However, when you run Spark in cluster modes ("yarn-client" or "yarn-cluster") then the spark job takes advantage of all the Yarn nodes on the cluster. Tuning Spark properly to take advantage of these cluster resources can take some time, and many Spark jobs are not properly tuned. Hope that helps, and that I've understood the question.

Online	Offline
Last Visited	‎10-04-2016 10:20 PM

Member Since	‎09-24-2015 01:55 PM
Last Visited	‎10-04-2016 10:20 PM
Posts	98
Kudos received	73

Cloudera Community

Re: Fuzzy Algorithm in Apache Spark

Re: How to tune Spark for parallel processing when...

Re: Social Network Analysis using Spark MLLIB

Re: Configuring YARN queues for Spark notebooks

Re: Can Dataframe joins in Spark preserve order?

Re: spark.yarn.executor.memoryOverhead...

Re: How to find the best StepSize in a Spark ML Li...

Re: How split calculate in Spark ?

Re: [Newbie] Is it possible to write/run Spark cod...

Re: Executing Spark-submit with yarn-cluster mode ...

Re: Spark Job hangs when run on zeppelin

Re: Sample HDF/NiFi flow to Push Tweets into Solr/...

Re: Model training outside of edge node?

Re: zeppelin - how to remove results cache from no...

Re: Model training outside of edge node?