Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Choose Spark version from Oozie job (1.6 vs 2.0)

avatar
Explorer

I have both Spark 1.6 and 2.0 installed on my cluster. I see in the docs how to manually run a spark-submit job and choose 2.0 here. However, I launch my jobs using Oozie. Is there a way to specify for a given Oozie workflow spark action that I want to use the 2.0 engine vs 1.6?

1 ACCEPTED SOLUTION

avatar
Explorer

Ok, just to update, I followed the directions explicitly in the link provided by dsun (here). Using HDP 2.6 and Oozie 4.2 this is failing due to a known bug (jira). Basically what works with Spark 1.6 will not work with Spark 2.1 (via oozie anyway) due to a change in how Spark handles multiple files found in distributed cache (see here).

java.lang.IllegalArgumentException: Attempt to add (hdfs://hdpcluster/user/oozie/share/lib/lib_20170411215324/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.

I've tried removing multiple files, but there are so many (and even some duplicated in oozie sharelib and spark2 sharelib) that I'm afraid of removing them all and breaking 1.6 (thus removing ability to run any existing jobs under 1.6).

Looks like it may be fixed in Oozie 4.3, but not sure how to update just Oozie service using Ambari (maybe I'll post another question for this).

EDIT:

After removing all duplicate files found between the sharelib for oozie and spark2, I still could not run a Spark2 job from Oozie 4.2. Was getting ImportError for a custom python file I was trying to import from the main application py file. Seems that Oozie wasn't setting --py-files correctly (again, worked fine with Spark 1.6).

In conclusion, this is only experimental at best. Hopefully the next version of HDP will use the latest Oozie 4.3.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

In the latest HDP 2.6.x, Oozie works with either Spark 1 or Spark 2 - it's not side-by-side deployments.

You can follow these instructions to have Oozie work with different versions of Spark.

avatar
Explorer

Thank you dsun! I'm working on these steps today. It seems from the instructions that once the sharelib for spark2 is setup, I can switch a given workflow to point to spark2 by specifying in job.properties:

oozie.action.sharelib.for.spark=spark2

This would imply (I assume) that I can easily point back to using spark 1.6.3 by specifying:

oozie.action.sharelib.for.spark=spark

Is my assumption correct?

avatar
Explorer

Ok, just to update, I followed the directions explicitly in the link provided by dsun (here). Using HDP 2.6 and Oozie 4.2 this is failing due to a known bug (jira). Basically what works with Spark 1.6 will not work with Spark 2.1 (via oozie anyway) due to a change in how Spark handles multiple files found in distributed cache (see here).

java.lang.IllegalArgumentException: Attempt to add (hdfs://hdpcluster/user/oozie/share/lib/lib_20170411215324/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.

I've tried removing multiple files, but there are so many (and even some duplicated in oozie sharelib and spark2 sharelib) that I'm afraid of removing them all and breaking 1.6 (thus removing ability to run any existing jobs under 1.6).

Looks like it may be fixed in Oozie 4.3, but not sure how to update just Oozie service using Ambari (maybe I'll post another question for this).

EDIT:

After removing all duplicate files found between the sharelib for oozie and spark2, I still could not run a Spark2 job from Oozie 4.2. Was getting ImportError for a custom python file I was trying to import from the main application py file. Seems that Oozie wasn't setting --py-files correctly (again, worked fine with Spark 1.6).

In conclusion, this is only experimental at best. Hopefully the next version of HDP will use the latest Oozie 4.3.