Support Questions
Find answers, ask questions, and share your expertise

Configure number of executors with hive on spark

Highlighted

Configure number of executors with hive on spark

Rising Star

hey guys

 

I have a 3 node dev cluster running CDH 5.5.1, Parcels

1 NN, 3 DN

Hive on Spark (Standalone)

I am not able to control the number or Executors per Hive job (hive on spark option) 

here are my safety valve settings in hive

 

Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml

(Gateway Default Group)

=========================================================

<property>
<name>hive.metastore.local</name>
<value></value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>6</value>
</property>
<property>
<name>spark.executor.instances</name>
<value>3</value>
</property>

 

 

BTW, this above configuration works in my AWS 5 node larger cluster running CDH 5.4.7, Parcels 

 

Also on 5.5.1 if I start spark-shell I can see 3 executors in the Web UI of Spark 

 

http://xx.xxx.xx.xxx:4040/executors/

 

Am I missing some other config ?

 

Please advise

 

sanjay

2 REPLIES 2
Highlighted

Re: Configure number of executors with hive on spark

Rising Star

ok guys

I deleted the standalone service and added Spark On Yarn through Cloudera Manager.

I ran a smoke test hive on spark job. It ran successfully. I know its slow because I have not done any optimizations.

I will configure some params to speed this up based on Cloudera recommendations  here and report back.

http://www.cloudera.com/documentation/enterprise/latest/topics/admin_hos_config.html

 

thanks

sanjay

Re: Configure number of executors with hive on spark

Rising Star

Yarn needed tuning for Spark to take advantage of Yarn

 

YARN MEMORY CALCULATIONS - 3 node cluster
=======================================

Each of my node is 32GB RAM , 8 cores, 2 X 2TB 7200RPM disks

(Souped up HP8300 Elite desktops)


# of Containers = minimum of (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE)
= MIN(2 * 8 , 1.8 * 2 , (32 - 8)/ 2 ) = MIN(16, 3.6, 12) = 4

MIN_CONTAINER_SIZE = 2GB

RAM-per-Container = maximum of (MIN_CONTAINER_SIZE, (Total Available RAM) / Containers))
= MIN( 2, 24/4) = 2

yarn.nodemanager.resource.memory-mb = Containers * RAM-per-Container
= 4 * 2 = 8

yarn.scheduler.minimum-allocation-mb = RAM-per-Container
= 2 GB

yarn.scheduler.maximum-allocation-mb = containers * RAM-per-Container
= 8 GB

mapreduce.map.memory.mb = RAM-per-Container
= 2 GB

mapreduce.reduce.memory.mb = 2 * RAM-per-Container
= 2 X 2 = 4 GB

mapreduce.map.java.opts = 0.8 * RAM-per-Container
= 0.8 * 2 GB = 1.6GB

mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-Container
= 0.8 * 2 * 2 GB = 3.2GB

yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-Container
= 2 X 2 = 4GB

yarn.app.mapreduce.am.command-opts = 0.8 * 2 * RAM-per-Container
= 0.8 X 2 X 2 = 3.2GB

 

 

Hive query = 187.175 seconds

set hive.execution.engine=mr;

SET mapreduce.job.reduces=8;

SET mapreduce.tasktracker.map.tasks.maximum=12;

SET mapreduce.tasktracker.reduce.tasks.maximum=8;

SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET mapreduce.map.output.compress=true;

select count(*) from cdr.cdr_mjp_uni_srch_usr_hist_view

 

 

Spark Query  = 128.089 seconds

set hive.execution.engine=spark;

select count(*) from cdr.cdr_mjp_uni_srch_usr_hist_view