Please suggest which is better among the following two approaches regarding parallelism in spark,
1. Since the source remains common for all the target, read once and execute the transformations and actions in foreach loop, read from the parVector(which has the vector of configurable inputs passed to the for loop and based on that inputs, each step is performed). Read once write many is achieved
2. Trigger each spark job for each row in the array of configurable input in cluster mode simultaneously.
Please suggest which approach is the best way of implementing parallelism. I see the second approach is the best among the two.
Request to answer to as earliest as possible. Based on the suggestion, we can implement the same.
Thanks in advance
Hi @Teepika R M
Though both the approaches should work fine , I could see that triggering separate job for reading each line would consume unnecessary reading of the source multiple times though hadoop is designed for multiple read. I feel that first approach should work as it wont end up creating the same set of source in multiple RDD. Hope it helps!!