Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Difference between local[*] vs yarn cluster vs yarn client for SparkConf - Java,SparkConf Master URL Configuration

avatar
New Contributor

My Scenario

I would like to expose a java micro service <Springboot appln> which should eventually run a spark submit to yield the required results,typically as a on demand service

I have been allotted with 2 data nodes and 1 edge node for development, where this edge node has the micro services deployed. When I tried yarn-cluster, got an exception 'Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.'

Help me to get an ideal way to deal with it. What should be the approach to be looked at? Since the service is on demand, I cannot deal with YARN Client to have more Main Class than one which is already used up for springboot starter.

Codes here

MicroServiceController.java:

@RequestMapping(value = "/transform", method = RequestMethod.POST, consumes = MediaType.APPLICATION_JSON_VALUE, produces = MediaType.APPLICATION_JSON_VALUE)

public String initiateTransformation(@RequestBody TransformationRequestVO requestVO){

PublicationProcessor.run();

return "SUCCESS";

}

PublicationProcessor.java

public static void run() {

try{

SparkConf sC = new SparkConf().setAppName("NPUB_TRANSFORMATION_US") .setMaster("yarn-clsuter") .set("spark.executor.instances", PropertyBundle.getConfigurationValue("spark.executor.instances")) .set("spark.executor.cores", PropertyBundle.getConfigurationValue("spark.executor.cores")) .set("spark.driver.memory",PropertyBundle.getConfigurationValue("spark.driver.memory")) .set("spark.executor.memory",PropertyBundle.getConfigurationValue("spark.executor.memory")) .set("spark.driver.maxResultSize", PropertyBundle.getConfigurationValue("spark.driver.maxResultSize")) .set("spark.network.timeout",PropertyBundle.getConfigurationValue("spark.network.timeout"));

JavaSparkContext jSC = new JavaSparkContext(sC);

sqlContext = new SQLContext(jSC);

processTransformation();

}catch(Exception e){

System.out.println("REQUEST ABORTED..."+e.getMessage());

}

}

,

2 REPLIES 2

avatar
Guru

@Faisal R Ahamed, You should use spark-submit to run this application. While running application specify --master yarn and --deploy-mode cluster. Specifying to spark conf is too late to switch to yarn-cluster mode.

spark-submit --class <clasname> --master yarn --deploy-mode cluster <jars> <args>

https://www.mail-archive.com/user@spark.apache.org/msg57869.html

avatar

local[*]

new SparkConf() .setMaster("local[2]")
  • This is specific to run the job in local mode
  • This is specifically used to test the code in small amount of data in local environment
  • It Does not provide the advantages of distributed environment
  • * is the number of cpu cores to be allocated to perform the local operation
  • It helps in debugging the code by applying breakpoints while running from Eclipse or IntelliJ

yarn-client

--master yarn --deploy-mode client
  • Yarn client mode: your driver program is running on the yarn client where you type the command to submit the spark application (may not be a machine in the yarn cluster). In this mode, although the drive program is running on the client machine, the tasks are executed on the executors in the node managers of the YARN cluster

yarn-cluster

--master yarn --deploy-mode cluster
  • This is the most advisable pattern for executing/submitting your spark jobs in production
  • Yarn cluster mode: Your driver program is running on the cluster master machine where you type the command to submit the spark application