Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Getting Spark stages to run in parallel inside an application

Getting Spark stages to run in parallel inside an application

New Contributor

I have a pyspark job launched with spark.master set to 'yarn-client'.  It architecturally looks like this:

 

rdd1 = sc.textFile(file1)\

                // misc map and filter operations

                .map(lambda x: [ (x[0]), [other stuff] ])\

 

rdd2 = sc.textFile(file2)\

               // misc map and filter operations

               .map(lambda x: [ (x[0]), [different other stuff]])

 

joined_rdd = rdd1.join(rdd2)\

                            .saveAsTextFile(output_file)

 

I expected to see rdd1 and rdd2 computed in parallel, then joined_rdd calculated.  However, rdd1, rdd2, and rdd3 all seem to be computed in series.  I have spun up a lot of executors, and most of them sit unused while rdd1 is being calculated, when they could start processing rdd2.  This, of course, makes the overall processing time much longer.   Is this expected behavior?  Am I doing something wrong?

 

Thanks!

 

3 REPLIES 3

Re: Getting Spark stages to run in parallel inside an application

Master Collaborator
It should be able to parallelize these but it would also require
enough executor slots to cover all of the tasks. So if # partitions
for both is more than your free slots, I don't think it would schedule
a second stage.

Have a look at the DAG in the UI to confirm the sequence of ops is
actually parallelizable according to Spark.
Highlighted

Re: Getting Spark stages to run in parallel inside an application

New Contributor

I was pointing at a small test data set in order to understand Spark's behavior in this situation, having observed the same behavior on larger data sets.  In this case, the files were roughly 1 partition big, I had 100 executors spun up, 1 did all the work and 99 sat idle.  Are you saying that you do expect to Spark to parallize rdd1 and rdd2 given adequate resources?  Are there any job parameters I may be setting wrong that you can think of?

 

I've observed similar behavior when doing things like:

 

rdd.count()

rdd.saveAsTextFile()

 

where they always execute in series, even though they could be parallelized.  In the above case, you can at least claim that since no shuffle is required for either operation, it makes sense to do both operations on the same executor.   However, I didn't see that rationale for the case I reference here.

 

Thanks for your help,

 

Rob

Re: Getting Spark stages to run in parallel inside an application

Master Collaborator
Yes, actions invoked serially in the driver will execute serially.
That's your second example. You can execute them in parallel in the
driver with a little bit of code, and that should also work perfectly
fine.

However I think, or thought, that an RDD with two parents would have
them both evaluated at the same time if possible. You can click
through to the job's DAG visualization to get a sense of what it
thinks it is doing and in what order. I'd have to also run a little
test to confirm or deny whether it works this way.
Don't have an account?
Coming from Hortonworks? Activate your account here