<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark not using Yarn cluster resources in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141693#M104286</link>
    <description>&lt;P&gt;Can you check if somewhere in spark-example.py or any spark-default.conf is overriding the master and deploy-mode properties &lt;/P&gt;</description>
    <pubDate>Wed, 11 May 2016 18:54:55 GMT</pubDate>
    <dc:creator>pjoseph</dc:creator>
    <dc:date>2016-05-11T18:54:55Z</dc:date>
    <item>
      <title>Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141689#M104282</link>
      <description>&lt;P&gt;I'm trying to run a Python script using Spark (1.6.1) on an Hadoop cluster (2.4.2). The cluster was installed, configured and managed using Ambari (2.2.1.1).
I have a cluster of 4 nodes (each 40Gb HD-8 cores-16Gb RAM). &lt;/P&gt;&lt;P&gt;My script uses `sklearn` lib: so in order to parallelize it on spark I use `spark_sklearn` lib (see it on &lt;A target="_blank" href="https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html" rel="nofollow noopener noreferrer"&gt;spark-sklearn-link&lt;/A&gt;). &lt;/P&gt;&lt;P&gt;At this point I tried to run the script with: &lt;/P&gt;&lt;PRE&gt;spark-submit spark_example.py --master yarn --deploy-mode client --num-executors 8 --num-executor-core 4 --executor-memory 2G &lt;/PRE&gt;&lt;P&gt;but it runs always on localhost with only one executor.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="4171-screenshot-2016-05-11-115329.png" style="width: 2538px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/21713i342AC9BCC6FD1F6F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="4171-screenshot-2016-05-11-115329.png" alt="4171-screenshot-2016-05-11-115329.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Also from Ambari dashboard I can see that only one node of the cluster is resource-consuming. And also trying different configuration (executors, cores) the execution time is the same. &lt;/P&gt;&lt;P&gt;This is Yarn UI Nodes screenshot:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="4170-screenshot-2016-05-11-120946.png" style="width: 2550px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/21714iD5F058DE8A3CB214/image-size/medium?v=v2&amp;amp;px=400" role="button" title="4170-screenshot-2016-05-11-120946.png" alt="4170-screenshot-2016-05-11-120946.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Any ideas?
Thanks a lot &lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 08:13:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141689#M104282</guid>
      <dc:creator>pietro_fragnit1</dc:creator>
      <dc:date>2019-08-19T08:13:35Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141690#M104283</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10378/pietrofragnito.html" nodeid="10378"&gt;@Pietro Fragnito&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You may also need to check your spark-env.sh file and make sure that MASTER=yarn-client variable is set.&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 18:05:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141690#M104283</guid>
      <dc:creator>jyadav</dc:creator>
      <dc:date>2016-05-11T18:05:13Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141691#M104284</link>
      <description>&lt;P&gt;Where can I find this file? Thanks a lot&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 18:20:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141691#M104284</guid>
      <dc:creator>pietro_fragnit1</dc:creator>
      <dc:date>2016-05-11T18:20:14Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141692#M104285</link>
      <description>&lt;P&gt;Ok I found it, but there isn't this parameter.&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 18:24:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141692#M104285</guid>
      <dc:creator>pietro_fragnit1</dc:creator>
      <dc:date>2016-05-11T18:24:23Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141693#M104286</link>
      <description>&lt;P&gt;Can you check if somewhere in spark-example.py or any spark-default.conf is overriding the master and deploy-mode properties &lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 18:54:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141693#M104286</guid>
      <dc:creator>pjoseph</dc:creator>
      <dc:date>2016-05-11T18:54:55Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141694#M104287</link>
      <description>&lt;P&gt;OK, then try to set that parameter and run again?&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 19:05:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141694#M104287</guid>
      <dc:creator>jyadav</dc:creator>
      <dc:date>2016-05-11T19:05:13Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141695#M104288</link>
      <description>&lt;P&gt;This is spark-default.conf&lt;/P&gt;&lt;PRE&gt;# Generated by Apache Ambari. Wed May 11 10:32:59 2016

spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address tesi-vm-3.cloud.ba.infn.it:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3
&lt;/PRE&gt;&lt;P&gt;And this is my spark-example.py&lt;/P&gt;&lt;PRE&gt;from pyspark import SparkContext

import numpy as np
import pandas as pd
from sklearn import grid_search, datasets
from sklearn.svm import SVR
from spark_sklearn import GridSearchCV
import sklearn

import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib
matplotlib.use('Agg')
matplotlib.style.use('ggplot')
import time
import StringIO

sc = SparkContext(appName="PythonPi")

def show(p):
  img = StringIO.StringIO()
  p.savefig(img, format='svg')
  img.seek(0)
  print "%html &amp;lt;div style='width:1200px'&amp;gt;" + img.buf + "&amp;lt;/div&amp;gt;"

#hourlyElectricity = pd.read_excel('hdfs:///dataset/building_6_ALL_hourly.xlsx')
hourlyElectricity = pd.read_excel('/dataset/building_6_ALL_hourly.xlsx')

#display one dataframe
print hourlyElectricity.head()

hourlyElectricity = hourlyElectricity.set_index(['Data'])
hourlyElectricity.index.name = None
print hourlyElectricity.head()

def addHourlyTimeFeatures(df):
    df['hour'] = df.Ora
    df['weekday'] = df.index.weekday
    df['day'] = df.index.dayofyear
    df['week'] = df.index.weekofyear    
    return df

hourlyElectricity = addHourlyTimeFeatures(hourlyElectricity)
print hourlyElectricity.head()
df_hourlyelect = hourlyElectricity[['hour', 'weekday', 'day', 'week', 'CosHour', 'Occupancy', 'Power']]

hourlyelect_train = pd.DataFrame(data=df_hourlyelect, index=np.arange('2011-01-01 00:00:00', '2011-10-01 00:00:00', dtype='datetime64[h]')).dropna()
hourlyelect_test = pd.DataFrame(data=df_hourlyelect, index=np.arange('2011-10-01 00:00:00', '2011-11-01 00:00:00', dtype='datetime64[h]')).dropna()

XX_hourlyelect_train = hourlyelect_train.drop('Power', axis = 1).reset_index().drop('index', axis = 1)

XX_hourlyelect_test = hourlyelect_test.drop('Power', axis = 1).reset_index().drop('index', axis = 1)

YY_hourlyelect_train = hourlyelect_train['Power']
YY_hourlyelect_test = hourlyelect_test['Power']

# Optimal parameters for the SVR regressor
gamma_range = [0.001,0.0001,0.00001,0.000001,0.0000001]
epsilon_range = [x * 0.1 for x in range(0, 2)]
C_range = range(3000, 8000, 500)

tuned_parameters = {
    'kernel': ['rbf']
    ,'C': C_range
    ,'gamma': gamma_range
    ,'epsilon': epsilon_range
    }

#start monitoring execution time
start_time = time.time()

# search for the best parameters with crossvalidation.

#svr_hourlyelect = GridSearchCV(SVR(C=5000, epsilon=0.0, gamma=1e-05), param_grid = tuned_parameters, verbose = 0)

svr_hourlyelect = GridSearchCV(sc,SVR(), param_grid = tuned_parameters, verbose = 0)

# Fit regression model
y_hourlyelect = svr_hourlyelect.fit(XX_hourlyelect_train, YY_hourlyelect_train).predict(XX_hourlyelect_test)

print("--- %s minutes ---" % ((time.time() - start_time)/60))
print 'Optimum epsilon and kernel for SVR: ', svr_hourlyelect.best_params_
print "The test score R2: ", svr_hourlyelect.score(XX_hourlyelect_test, YY_hourlyelect_test)

print("SVR mean squared error: %.2f" % np.mean((YY_hourlyelect_test - svr_hourlyelect.predict(XX_hourlyelect_test)) ** 2))
&lt;/PRE&gt;&lt;P&gt;I think there aren't overridings&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 19:06:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141695#M104288</guid>
      <dc:creator>pietro_fragnit1</dc:creator>
      <dc:date>2016-05-11T19:06:06Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141696#M104289</link>
      <description>&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/submitting-applications.html" target="_blank"&gt;http://spark.apache.org/docs/latest/submitting-applications.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;--deploy-mode cluster&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 11 May 2016 20:40:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141696#M104289</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-05-11T20:40:13Z</dc:date>
    </item>
    <item>
      <title>Re: Spark not using Yarn cluster resources</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141697#M104290</link>
      <description>&lt;P&gt;Ok Thanks! Seems adding this param works for me.&lt;/P&gt;&lt;PRE&gt;#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

MASTER="yarn-cluster"

# Options read in YARN client mode
SPARK_EXECUTOR_INSTANCES="3" #Number of workers to start (Default: 2)
#SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1).
#SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
#SPARK_DRIVER_MEMORY="512 Mb" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
#SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark)
#SPARK_YARN_QUEUE="~@~Xdefault~@~Y" #The hadoop queue to use for allocation requests (Default: @~Xdefault~@~Y)
#SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job.
#SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.


# Generic options for the daemons used in the standalone deploy mode


# Alternate conf dir. (Default: ${SPARK_HOME}/conf)
export SPARK_CONF_DIR=${SPARK_CONF_DIR:-{{spark_home}}/conf}


# Where log files are stored.(Default:${SPARK_HOME}/logs)
#export SPARK_LOG_DIR=${SPARK_HOME:-{{spark_home}}}/logs
export SPARK_LOG_DIR={{spark_log_dir}}


# Where the pid file is stored. (Default: /tmp)
export SPARK_PID_DIR={{spark_pid_dir}}


# A string representing this instance of spark.(Default: $USER)
SPARK_IDENT_STRING=$USER


# The scheduling priority for daemons. (Default: 0)
SPARK_NICENESS=0


export HADOOP_HOME=${HADOOP_HOME:-{{hadoop_home}}}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-{{hadoop_conf_dir}}}


# The java implementation to use.
export JAVA_HOME={{java_home}}


if [ -d "/etc/tez/conf/" ]; then
  export TEZ_CONF_DIR=/etc/tez/conf
else
  export TEZ_CONF_DIR=
fi

&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;ps&lt;/STRONG&gt;:it works well but seems the params passed via command line (e.g.: --num-executors 8--num-executor-core 4--executor-memory 2G) are not taken in consideration. Instead, if I set the executors in "spark-env template" filed of Ambari, the params are taken in consideration. Anyway now it works &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thanks a lot.&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 15:37:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-not-using-Yarn-cluster-resources/m-p/141697#M104290</guid>
      <dc:creator>pietro_fragnit1</dc:creator>
      <dc:date>2016-05-12T15:37:35Z</dc:date>
    </item>
  </channel>
</rss>

