Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Parallel execution in spark


Parallel execution in spark

New Contributor

Please suggest which is better among the following two approaches regarding parallelism in spark,

1. Since the source remains common for all the target, read once and execute the transformations and actions in foreach loop, read from the parVector(which has the vector of configurable inputs passed to the for loop and based on that inputs, each step is performed). Read once write many is achieved

2. Trigger each spark job for each row in the array of configurable input in cluster mode simultaneously.

Please suggest which approach is the best way of implementing parallelism. I see the second approach is the best among the two.

Request to answer to as earliest as possible. Based on the suggestion, we can implement the same.

Thanks in advance


Re: Parallel execution in spark

Hi @Teepika R M

Though both the approaches should work fine , I could see that triggering separate job for reading each line would consume unnecessary reading of the source multiple times though hadoop is designed for multiple read. I feel that first approach should work as it wont end up creating the same set of source in multiple RDD. Hope it helps!!

Don't have an account?
Coming from Hortonworks? Activate your account here