I am using the latest version of CDH 5.3.2 with Spark 1.2. Apache came out with 1.3 today. I would like to upgrade my spark. Does anybody have any suggestion on how I could do it?
If you're using Spark on YARN, you don't really need to upgrade anything, since Spark is just another app that runs on YARN. You can run Spark 1.3.0 regardless of what is installed on the cluster. Just get a distribution onto one of the machines and invoke scripts/binaries from the 1.3.0 distribution.
CDH 5.4 is coming pretty soon and has 1.3.0, too.
I am still doing POC on a product and I am trying to do benchmarks for time to execution based upon a custom build of Spark. I can execute as a Yarn app as you had mentioned.
Can you tell me what performance benefits I may be giving up by not attempting to properly install this custom Spark build on the cluster and simply using YARN instead?
Just to clarify!
Installing Spark 1.3 will be another 'service' presented to CM.
In other words when I launch CM GUI there will be 2 Spark 'services', one for ver 1.2 and another for ver 1.3
As far as the command line, I have to source the proper environment in order to run the respective Spark Shell.
Is that correct?
I don't think there's any performance difference. The main difference is simply that the custom build isn't supported or necessarily using the config established by CM. No, it does not become another service in CM. The CM service "Spark" is for standalone mode (not YARN), and the History server.
Thank you for your reply!
1) If Spark 1.3 is not presented as a new service into CM, then the 'Spark Service' within CM will be which version??
2) Should BOTH Spark 1.2 & Spark 1.3 co-exist at the OS-level? Then you source the appropriate env??
3) Since you mentioned YARN's JobHistory Server.
A Spark job runs into the HDFS directory /tmp/logs/<user-id>/logs/ but it does NOT write out to /user/history/done & /user/history/done_intermediate folders!!!
Is this because the Spark job runs in Standalone Mode and not in YARN??
I'm not sure what you mean. You are talking about running some custom build of Spark 1.3. You're on your own there. It has no relation to CM of course. CDH ships Spark 1.3. You should use that unless you have a reason not to. CDH itself has only one version of Spark "installed" and you should not modify it. This is presented in CM. I'm not sure what you mean about the logs. You can run Spark standalone or YARN mode. It doesn't make you run one or the other.
Thank you again!
Maybe I didn't make myself clear!
I have currently installed Spark 1.2, it came with CDH 5.3.1.
It also exists a Spark Service within CM (this service relates back to Spark 1.2).
Now, I am in need to install Spark 1.3.
Do I remove/uninstall Spark 1.2 and install Spark 1.3???
As far as JobHistory logs.
When I launch a Spark job, logs being created into /tmp/logs/<user-id>/logs but NOT in /user/history/ folders!
Then, when I launch the JobHistory portal (http://<YARN-JobHistory-Server>:19888/jobhuistory) it shows no jobs!!!
Is there a daemon that copies the logs from the /tmp/logs/<user-id>/logs fodler to the /user/history/done & /user/history/done_intermediate ones?