Created 06-15-2017 08:15 PM
I have both Spark 1.6 and 2.0 installed on my cluster. I see in the docs how to manually run a spark-submit job and choose 2.0 here. However, I launch my jobs using Oozie. Is there a way to specify for a given Oozie workflow spark action that I want to use the 2.0 engine vs 1.6?
Created 06-16-2017 04:38 PM
Ok, just to update, I followed the directions explicitly in the link provided by dsun (here). Using HDP 2.6 and Oozie 4.2 this is failing due to a known bug (jira). Basically what works with Spark 1.6 will not work with Spark 2.1 (via oozie anyway) due to a change in how Spark handles multiple files found in distributed cache (see here).
java.lang.IllegalArgumentException: Attempt to add (hdfs://hdpcluster/user/oozie/share/lib/lib_20170411215324/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.
I've tried removing multiple files, but there are so many (and even some duplicated in oozie sharelib and spark2 sharelib) that I'm afraid of removing them all and breaking 1.6 (thus removing ability to run any existing jobs under 1.6).
Looks like it may be fixed in Oozie 4.3, but not sure how to update just Oozie service using Ambari (maybe I'll post another question for this).
EDIT:
After removing all duplicate files found between the sharelib for oozie and spark2, I still could not run a Spark2 job from Oozie 4.2. Was getting ImportError for a custom python file I was trying to import from the main application py file. Seems that Oozie wasn't setting --py-files correctly (again, worked fine with Spark 1.6).
In conclusion, this is only experimental at best. Hopefully the next version of HDP will use the latest Oozie 4.3.
Created 06-15-2017 09:42 PM
In the latest HDP 2.6.x, Oozie works with either Spark 1 or Spark 2 - it's not side-by-side deployments.
You can follow these instructions to have Oozie work with different versions of Spark.
Created 06-16-2017 01:40 PM
Thank you dsun! I'm working on these steps today. It seems from the instructions that once the sharelib for spark2 is setup, I can switch a given workflow to point to spark2 by specifying in job.properties:
oozie.action.sharelib.for.spark=spark2
This would imply (I assume) that I can easily point back to using spark 1.6.3 by specifying:
oozie.action.sharelib.for.spark=spark
Is my assumption correct?
Created 06-16-2017 04:38 PM
Ok, just to update, I followed the directions explicitly in the link provided by dsun (here). Using HDP 2.6 and Oozie 4.2 this is failing due to a known bug (jira). Basically what works with Spark 1.6 will not work with Spark 2.1 (via oozie anyway) due to a change in how Spark handles multiple files found in distributed cache (see here).
java.lang.IllegalArgumentException: Attempt to add (hdfs://hdpcluster/user/oozie/share/lib/lib_20170411215324/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache.
I've tried removing multiple files, but there are so many (and even some duplicated in oozie sharelib and spark2 sharelib) that I'm afraid of removing them all and breaking 1.6 (thus removing ability to run any existing jobs under 1.6).
Looks like it may be fixed in Oozie 4.3, but not sure how to update just Oozie service using Ambari (maybe I'll post another question for this).
EDIT:
After removing all duplicate files found between the sharelib for oozie and spark2, I still could not run a Spark2 job from Oozie 4.2. Was getting ImportError for a custom python file I was trying to import from the main application py file. Seems that Oozie wasn't setting --py-files correctly (again, worked fine with Spark 1.6).
In conclusion, this is only experimental at best. Hopefully the next version of HDP will use the latest Oozie 4.3.