Release notes for CDH 5.3 state that CDH's release of Spark 1.2 uses Akka 2.2.3 whereas the standard Apache distribution of Spark 1.2 uses Akka 2.3.4. What are the consequences of this? In particular, we use Spark with Yarn, not Spark standalone, and we build a Spark assembly jar using ...
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=CDH5VERSION ...
and install the resulting Assembly on the cluster. Is there some conflict between CDH 5.3 dependencies and Akka 2.3.4 that we must be aware of??
Marcelo says it traces back to a version conflict with other CDH components in Typesafe Config lib versions, and that in turn means a different Akka dependency. Unfortunately I suspect that Akka 2.2 and 2.3 are not 100% compatible. That said I'm not sure any of us can say it doesn't work. You would know pretty quickly.
Thanks. Just getting around to testing the upgrade. But before getting deep in trouble, some additional questions...
1. We typically build our own Spark assembly to correspond to the CDH Hadoop/Yarn version, the Spark version, and our use of Yarn. I don't see any pre-built versions available as part of the CDH 5.3.1 distribution, but I may be wrong. (I'm looking in the Maven repository.) And it is not clear if the assembly jars that do appear there have been built with Yarn support. So...
2. Exploring the possibilities...
Are you looking for things like
I am actually not sure why there is no 5.3 version of the assembly; I will ask. I think the idea is that this isn't a component one uses via Maven but as a distributed binary, so shouldn't have been published before.
Yes it is always built with YARN support.
Changes vs 1.2.0 are here:
Changes marked as fixed in 1.2.1 ought to be a roughly reliable report from upstream on what changed from 1.2.0 to 1.2.1:
Both are bug fix releases and generally these lists are of interest when you are interested in a particular change. They ought to largely overlap.
Akka 2.2 conflicts with Akka 2.3 in general. The problem with Spark's 2.3 upgrade was that it was not 100% backwards compatible, and this in turn caused conflicts with other things in Hadoop. So tweaking this back to 2.3 has some consequence, but, ones which may not affect you. Also consider shading Akka 2.3.
You should always be able to build from the tarball source in CDH if you want to use precisely the same bits. You can also build from upstream source, and setting hadoop version to 1.2.0-cdh5.3.1 configures most of the things that could vary in component versions vs upstream. You could even run the stock upstream assembly and it would likely work for 99% of cases.
Since you're going off-road anyway, you could indeed just build from upstream source circa 1.2.1 and set the Akka version.
Longer term, I think the idea is to remove Akka from Spark anyway, which makes this a moot point.
1. It is odd that it was built for earlier AND later Cloudera releases. Just not 5.3.x which is our chosen platform. Which is too bad because the Spark version is 1.2.0, and there were several problems with that release not fully corrected until 1.2.1. One of the problems is that spark-yarn and parent-yarn, both of which are needed if one is to repackage SparkSubmit with Yarn support in an application, are NOT distributed as part of Maven Central or in Cloudera's repository.
2. It should be documented that the assembly is built with Yarn support. I would expect it is, but ...
3. The Cloudera patches do not include distribution of yarn-parent or spark-yarn.
4. It looks like Spark 1.2.1 shades Akka 2.3.4. I need to double check, though.
5. I think I'm headed to 1.2.1 with Akka 2.2.3 unless it shades 2.3.4.
Akka is so fundamental for Spark. What's the replacement. I'm busy writing applications. I can't keep up with the Dev conversation. (I think I've got 2500 dev and user emails sitting in my mail box waiting for a quick review, and that's after deleting 3500!!!) ;-)
The assembly was not published for any version after 5.1, actually:
(Those are SNAPSHOT builds. No idea why the SNAPSHOT turned up again for 5.4 but it's ignorable.)
Upstream has not published the assembly in Maven since 1.1.1, so this is just tracking upstream:
You don't want to consume the assembly as a Maven artifact, so it was always unintentional to publish it that way by the upstream project.
Same with spark-yarn. The upstream project did not intend to publish them, and stopped as of 1.2.0:
... but as you can see, it came back due to popular demand. I believe therefore the next CDH will include it again.
yarn-parent is almost the same story except that it went away again in 1.3.0 as it is no longer needed (it was part of supporting old YARN alpha, which was discontinued in 1.3):
CDH definitely supports YARN of course. Spark on YARN is doc'ed throughout, so yeah safe to assume it has to be built for YARN.
Spark 1.2.1 upstream uses a shaded version of Akka 2.3.4:
Akka is "just" a message transport layer for Spark, and I think the idea being batted around is to use some lighter-weight framework for it. You can track the bag of related issues at https://issues.apache.org/jira/browse/SPARK-5293
Indeed you shouldn't have to care about that and part of the changes here are intended to make it irrelevant what Spark does underneath.
Welcome to my world; I actually read every message every day!
PS Marcelo mentions that it actually *is* coming back in 5.4 since Oozie needs to make use of it. It's not published upstream but is just built from the upstream assembly target.
Using the "upstream" source -- I suppose that means Apache Spark's distribution -- we built an assembly as follows:
mvn clean package -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.1 -DskipTests
When building our applications, we linked the plain vanilla Apache Spark artifacts but the CDH 5.3.1 artifacts for Hadoop, Yarn, Hive, etc... E.g.,
we did NOT change the Akka version in the upstream source.
This seems to work -- with one big exception. With exception of that one problem (unrelated to Akka), the applications build. They run on pseudoclusters and a real just-upgraded CDH 5.3.1 cluster, without regressions, but some differences (some better, some worse) in performance if I replace the default Snappy compression with the formerly default LZF.
So far so good.
But the problem is with API changes in yarn.Client (org.apache.spark.deploy.yarn.Client). This should probably be a new topic, but let me start it here because it is urgent.
Sandy Riza and I have conversed about some of these issues at length in the past.
Here is the use case. The application is modular. There is a GUI plus other related services that occasionally run Spark jobs on Yarn in "cluster" mode. We built some utilities to help this, basically simulating and simplifying what SparkSubmit does. We simplify it because there's only one master, Yarn, and only one mode, cluster. It configures the yarn and application settings given the data and user interaction. Importantly, it needs access to the job's Yarn application id and preferably other Yarn job status reports.
In order to do this, one of my colleagues decided to call and configure yarn.Client directly, because Client can provide the needed job tracking information. However, in 1.2.x yarn.Client is now private. What is the reason for this? I notice that SparkSubmit only invokes yarn.Client by reflection. In my own experiments before I turned this over to my colleague, I did the same. But this leaves the calling module without the application id and without any status information about the progress of the Client.
Using Client in this way has worked for us until now.
Now of course, the Client's job is to request an Application Master from the Resource Manager, and to start the application driver in a Yarn container. I'm not sure why yarn.Client needs to be started by reflection.
Are there alternatives? Our only choice at the moment is to basically rewrite our own Client, which of course depends on other spark-yarn components that seem to have been made private.
I think the short answer is that this was never supposed to be an API for external callers and thus not supported as a stable API. It's why the whole artifact went away, but came back as a "developer API". I'm afraid it's pretty use-at-your-own risk, and the supported way of running Spark apps is only the spark-submit script.
Have you by chance seen the "Launcher" API coming in 1.4? Marcelo created this to be a much more real and fully functionaln programmatic API for submission. It might be what you're looking for, but it's still months away from being in a stable release.
Argh!! That's exactly what we need.
Have you got Marcelo's email or the Jira ticket item that tracks this???