The Spark guide mentions that CDH Spark lacks some features such as Spark SQL for Pyspark and the new spark.ml API. Where can i find more information on the changes that Cloudera made to Apache Spark for CDH 5.6? What is the base version of Spark being used (CDH 5.5.2 uses Spark 1.5.0 afaik).
Hortonworks HDP 2.4 includes it (v.1.6.0).
Anyway, SparkR was merged into Spark project since 1.4 (see old AmpLab project page), so I don't understand why Cloudera can't just ship it along with the rest of the Spark package. It seems a conscious decision to remove the module - what's the reason?
See my reply above. You'd be surprised how many people complain about shipping things that aren't supported. It's about as many that complain about not shipping things that aren't supported.
Specific to R: Shipping or otherwise arranging to install R is a small barrier because it is GPL and can't ship with CDH. This ultimately isn't a big barrier.
Supportability is also a moderate issue. It's not trivial to get the whole support machine able to actually provide support for a new environment and technology, and R is not just another big data tool. Again that's more a question of effort.
Maturity is a moderate issue. The API continued to change over Spark 1.x. For a while you could dapply code across the cluster, then it was removed, then it was added back. It's more an argument that this sort of thing is hard to support rather than ship but these things are linked.
Lastly it's really demand. People do seem interested in "parallelizing R code" but it's not what SparkR does. They also use 3rd party tools like H2O + R or Revo. It hasn't been something people actually want to pay for support on.
Thanks for your detailed reply. That's a valid and understandable concern. We chose Cloudera for our production Hadoop platform precisely for the quality of integration and maturity you offer. We as users simply need some clarity from the vendor for observed feature discrepancies from the official distro, especially for such a critical component as Spark.
Are there any other discrepancy/customization that we should be aware of? Can Cloudera be more transparent in your release notes whenever you remove/modify features from the official open-source versions? Searching for "SparkR" in CDH5.7 release notes for Spark found 4 Jiras, which would give one the impression that SparkR is included.
It has always been documented in "Known Issues": https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html Generally speaking, there aren't differences. Not supported != different. However there are some pieces that aren't shipped like the thrift server and SparkR.
Usually differences crop up when upstream introduces a breaking change and it can't be followed in a minor release. For example: default in CDH is for the "legacy" memory config parameters to be active so that default memory config doesn't change in 1.6. Sometimes it relates to other stuff in the platfrom that can't change, like I think the Akka version is (was) different because other stuff in Hadoop needed a different version.
The biggest example of this IMHO is Spark Streaming + Kafka. Spark 1.x doesn't support Kafka 0.9+ but CDH 5.7+ had to move to it to get security features. So CDH Spark 1.6 will actually only work with 0.9+ because the Kafka differences are mutually incompatible. Good in that you can use recent Kafka, but, a difference!
Most of it though are warnings about incompatibilities between what Spark happens to support and what CDH ships in other components.