Created on 08-28-2014 04:54 AM - edited 09-16-2022 02:06 AM
Hi All,
I have cloudera enterprise data hub edition 5.1.0 installed in a single system. Due to some requirement I need to create one extra worker in spark. Currently, it has 1 master and 1 worker running but I want 1 master and 2 worker. I have tried to create following the guideline of CDH(added SPARK_WORKER_INSTANCES=2) in spark-env.sh. It didn't worked for me.
I followed the same steps in spark out of CDH(just downlaoded from apache website) I am able create extra worker.
Could someone let me know what would be steps for creating extra worker in spark inside CDH 5.1.0?
Thanks in advance.
Nishikant
Created 08-29-2014 04:00 AM
It doesn't make sense to put two workers on one host. One worker can host many executors, and an executor can even run many tasks in parallel. Your default parallelism will be a function of the number of cores, which should much more than 1. As long as your input has more than one partition you'll get parallel execution. If not, use repartition() to make more partitions.
Created 08-28-2014 05:09 AM
I assume you're working in standalone mode. You can just go to the Spark service in Cloudera Manager, click Instances, click Add Role Instances, and assign other hosts as workers.
You do not need to install Spark. It is already installed. In fact I would not change its configuration files directly unless you're sure you know what you're doing.
Created 08-28-2014 05:48 AM
Created 08-29-2014 12:54 AM
Hi All,
I am not able to create one extra worker in spark in CDH. I need 2 workers with 1 master in my cdh spark.
CDH spark has 1 master and 1 worker as default , this way I am not able to do group by opearion on streams. because of that I am looking for minimum 2 workers.
Thanks in advance
Nishi
Created 08-29-2014 04:00 AM
It doesn't make sense to put two workers on one host. One worker can host many executors, and an executor can even run many tasks in parallel. Your default parallelism will be a function of the number of cores, which should much more than 1. As long as your input has more than one partition you'll get parallel execution. If not, use repartition() to make more partitions.
Created 09-01-2014 12:45 AM
Created 09-01-2014 02:27 AM
See my message above about modifying roles. You would just set an additional host to be a worker. I'm assuming you are using standalone mode.
Created 09-01-2014 04:08 AM
Created 09-03-2014 12:04 AM
Thanks you very much
It solved my problem.