08-14-2017 06:08 AM - edited 08-14-2017 06:48 AM
I installed couldera manager , apache spark option, took all defaults.
I have 24 cores, 8 cores per VM, so 3 vm cluster
I am not sure of my cluster is running on 3 VMs ? See image below.
Maybe I have to start the 3 nodes explicitly?
I ran a job using spark submit, and I collect and print the reurn values.
In each map job I am effectively returning a String , which is the hostname.
My return value shows taht everything ran on vm-2. I was expecting 8 taks to run on vm-1, 8 on vm-2 and 8 on vm-3.
EDIT : I am using hashpartitoner(24), in an attempt to put 1 element on each core.
So I expect 8 jobs to run on each VM , 1 per core.
Any comments will be appreciated. If more detail is needed I can attach it here.
08-14-2017 06:29 AM
08-28-2017 05:38 AM
The Spark install in CDH relies on an existing installation of Yarn; The existing parcel only creates Gateway and History Server instances.
The Gateway (like gateways for hadoop services) give the host the appropriate configuration to submit jobs as any client would.
The History Server just maintains logs for running jobs.
To actually run jobs, you must do one of two things:
My wager is that you launched the job from the client on vm-2 in local mode, leading to the result that you saw.