Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

CDH 5.6 & Spark

avatar
Contributor

The Spark guide mentions that CDH Spark lacks some features such as Spark SQL for Pyspark and the new spark.ml API. Where can i find more information on the changes that Cloudera made to Apache Spark for CDH 5.6? What is the base version of Spark being used (CDH 5.5.2 uses Spark 1.5.0 afaik).

 

thanks, 

 Peter

1 ACCEPTED SOLUTION

avatar
Master Collaborator
That's not what it says; it say they just aren't supported, typically
because they're not "supported" in Spark either (e.g. experimental
API). Supported != doesn't work, just means you can't file a support
ticket for it.

CDH 5.6 = Spark 1.5 + patches, meaning it's like 1.5.2 likely with a
slightly different set of maintenance patches. It might not have
unimportant ones that maybe shouldn't be in a maintenance release, or
might have a critical one that was created after 1.5.2. Generally
speaking there are no other differences; it's just upstream Spark with
some tinkering with versions to make it integrate with other Hadoop
components correctly.

The exception is SparkR, which isn't even shipped, partly because CDH
can't ship R itself.

View solution in original post

6 REPLIES 6

avatar
Master Collaborator
That's not what it says; it say they just aren't supported, typically
because they're not "supported" in Spark either (e.g. experimental
API). Supported != doesn't work, just means you can't file a support
ticket for it.

CDH 5.6 = Spark 1.5 + patches, meaning it's like 1.5.2 likely with a
slightly different set of maintenance patches. It might not have
unimportant ones that maybe shouldn't be in a maintenance release, or
might have a critical one that was created after 1.5.2. Generally
speaking there are no other differences; it's just upstream Spark with
some tinkering with versions to make it integrate with other Hadoop
components correctly.

The exception is SparkR, which isn't even shipped, partly because CDH
can't ship R itself.

avatar
Contributor
thanks Sean!

avatar
Expert Contributor

Hortonworks HDP 2.4 includes it (v.1.6.0).

 

Anyway, SparkR was merged into Spark project since 1.4 (see old AmpLab project page), so I don't understand why Cloudera can't just ship it along with the rest of the Spark package.  It seems a conscious decision to remove the module - what's the reason?

avatar
Master Collaborator

See my reply above. You'd be surprised how many people complain about shipping things that aren't supported. It's about as many that complain about not shipping things that aren't supported.

 

Specific to R: Shipping or otherwise arranging to install R is a small barrier because it is GPL and can't ship with CDH. This ultimately isn't a big barrier.

 

 

Supportability is also a moderate issue. It's not trivial to get the whole support machine able to actually provide support for a new environment and technology, and R is not just another big data tool. Again that's more a question of effort.

 

Maturity is a moderate issue. The API continued to change over Spark 1.x. For a while you could dapply code across the cluster, then it was removed, then it was added back. It's more an argument that this sort of thing is hard to support rather than ship but these things are linked.

 

Lastly it's really demand. People do seem interested in "parallelizing R code" but it's not what SparkR does. They also use 3rd party tools like H2O + R or Revo. It hasn't been something people actually want to pay for support on.

 

avatar
Expert Contributor

Thanks for your detailed reply.  That's a valid and understandable concern.  We chose Cloudera for our production Hadoop platform precisely for the quality of integration and maturity you offer.  We as users simply need some clarity from the vendor for observed feature discrepancies from the official distro, especially for such a critical component as Spark.

 

Are there any other discrepancy/customization that we should be aware of?  Can Cloudera be more transparent in your release notes whenever you remove/modify features from the official open-source versions?  Searching for "SparkR" in CDH5.7 release notes for Spark found 4 Jiras, which would give one the impression that SparkR is included.

 

Thanks again,

Miles

avatar
Master Collaborator

It has always been documented in "Known Issues": https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html  Generally speaking, there aren't differences. Not supported != different. However there are some pieces that aren't shipped like the thrift server and SparkR.

 

Usually differences crop up when upstream introduces a breaking change and it can't be followed in a minor release. For example: default in CDH is for the "legacy" memory config parameters to be active so that default memory config doesn't change in 1.6. Sometimes it relates to other stuff in the platfrom that can't change, like I think the Akka version is (was) different because other stuff in Hadoop needed a different version.

 

The biggest example of this IMHO is Spark Streaming + Kafka. Spark 1.x doesn't support Kafka 0.9+ but CDH 5.7+ had to move to it to get security features. So CDH Spark 1.6 will actually only work with 0.9+ because the Kafka differences are mutually incompatible. Good in that you can use recent Kafka, but, a difference!

 

Most of it though are warnings about incompatibilities between what Spark happens to support and what CDH ships in other components.