Created on 02-01-2016 02:13 PM - edited 02-01-2016 02:34 PM
hey guys
I have a 3 node dev cluster running CDH 5.5.1, Parcels
1 NN, 3 DN
Hive on Spark (Standalone)
I am not able to control the number or Executors per Hive job (hive on spark option)
here are my safety valve settings in hive
Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml
(Gateway Default Group)
=========================================================
<property>
<name>hive.metastore.local</name>
<value></value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>6</value>
</property>
<property>
<name>spark.executor.instances</name>
<value>3</value>
</property>
BTW, this above configuration works in my AWS 5 node larger cluster running CDH 5.4.7, Parcels
Also on 5.5.1 if I start spark-shell I can see 3 executors in the Web UI of Spark
http://xx.xxx.xx.xxx:4040/executors/
Am I missing some other config ?
Please advise
sanjay
Created 02-01-2016 03:38 PM
ok guys
I deleted the standalone service and added Spark On Yarn through Cloudera Manager.
I ran a smoke test hive on spark job. It ran successfully. I know its slow because I have not done any optimizations.
I will configure some params to speed this up based on Cloudera recommendations here and report back.
http://www.cloudera.com/documentation/enterprise/latest/topics/admin_hos_config.html
thanks
sanjay
Created 02-01-2016 06:10 PM
Yarn needed tuning for Spark to take advantage of Yarn
YARN MEMORY CALCULATIONS - 3 node cluster
=======================================
Each of my node is 32GB RAM , 8 cores, 2 X 2TB 7200RPM disks
(Souped up HP8300 Elite desktops)
# of Containers = minimum of (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE)
= MIN(2 * 8 , 1.8 * 2 , (32 - 8)/ 2 ) = MIN(16, 3.6, 12) = 4
MIN_CONTAINER_SIZE = 2GB
RAM-per-Container = maximum of (MIN_CONTAINER_SIZE, (Total Available RAM) / Containers))
= MIN( 2, 24/4) = 2
yarn.nodemanager.resource.memory-mb = Containers * RAM-per-Container
= 4 * 2 = 8
yarn.scheduler.minimum-allocation-mb = RAM-per-Container
= 2 GB
yarn.scheduler.maximum-allocation-mb = containers * RAM-per-Container
= 8 GB
mapreduce.map.memory.mb = RAM-per-Container
= 2 GB
mapreduce.reduce.memory.mb = 2 * RAM-per-Container
= 2 X 2 = 4 GB
mapreduce.map.java.opts = 0.8 * RAM-per-Container
= 0.8 * 2 GB = 1.6GB
mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-Container
= 0.8 * 2 * 2 GB = 3.2GB
yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-Container
= 2 X 2 = 4GB
yarn.app.mapreduce.am.command-opts = 0.8 * 2 * RAM-per-Container
= 0.8 X 2 X 2 = 3.2GB
Hive query = 187.175 seconds
set hive.execution.engine=mr;
SET mapreduce.job.reduces=8;
SET mapreduce.tasktracker.map.tasks.maximum=12;
SET mapreduce.tasktracker.reduce.tasks.maximum=8;
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.map.output.compress=true;
select count(*) from cdr.cdr_mjp_uni_srch_usr_hist_view
Spark Query = 128.089 seconds
set hive.execution.engine=spark;
select count(*) from cdr.cdr_mjp_uni_srch_usr_hist_view