About srowen

srowen · ‎08-31-2016

No, I've never seen such a variable defined by Spark. You can probably look up "spark.master" in the SparkConf. But you don't need to query it in order to make a SparkContext in your app. It looks like you might have modified a standard Spark example, in which case just undo those changes.

srowen · ‎08-31-2016

There's actually not a notion of 'installing Spark on the cluster', really. It's a big JAR file that gets run along with some user code in a YARN container. For example, yesterday I took the vanilla upstream Apache Spark 2.0.0 (+ Hadoop 2.7) binary distribution, unpacked it on one cluster node, (and set HADOOP_CONF_DIR,) and was able to run the Spark 2.0.0 shell on a CDH 5.8 cluster with no further changes. Not everything works out of the box, like anything touching the Hive metastore, which would require a little more tweaking / config. But that's about it for 'installation', at heart. Note this is of course not supported, but, it's also something you can try without modifying any installation, which of course you would never want to do.

srowen · ‎08-26-2016

It has always been documented in "Known Issues": https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html Generally speaking, there aren't differences. Not supported != different. However there are some pieces that aren't shipped like the thrift server and SparkR. Usually differences crop up when upstream introduces a breaking change and it can't be followed in a minor release. For example: default in CDH is for the "legacy" memory config parameters to be active so that default memory config doesn't change in 1.6. Sometimes it relates to other stuff in the platfrom that can't change, like I think the Akka version is (was) different because other stuff in Hadoop needed a different version. The biggest example of this IMHO is Spark Streaming + Kafka. Spark 1.x doesn't support Kafka 0.9+ but CDH 5.7+ had to move to it to get security features. So CDH Spark 1.6 will actually only work with 0.9+ because the Kafka differences are mutually incompatible. Good in that you can use recent Kafka, but, a difference! Most of it though are warnings about incompatibilities between what Spark happens to support and what CDH ships in other components.

srowen · ‎08-25-2016

See my reply above. You'd be surprised how many people complain about shipping things that aren't supported. It's about as many that complain about not shipping things that aren't supported. Specific to R: Shipping or otherwise arranging to install R is a small barrier because it is GPL and can't ship with CDH. This ultimately isn't a big barrier. Supportability is also a moderate issue. It's not trivial to get the whole support machine able to actually provide support for a new environment and technology, and R is not just another big data tool. Again that's more a question of effort. Maturity is a moderate issue. The API continued to change over Spark 1.x. For a while you could dapply code across the cluster, then it was removed, then it was added back. It's more an argument that this sort of thing is hard to support rather than ship but these things are linked. Lastly it's really demand. People do seem interested in "parallelizing R code" but it's not what SparkR does. They also use 3rd party tools like H2O + R or Revo. It hasn't been something people actually want to pay for support on.

srowen · ‎08-18-2016

max-age-data-hours will cause it to delete data on HDFS that is older than this number of hours. This means that subsequent models will be built on historical data that does not include data older than this time. That's all there is to it.

srowen · ‎08-17-2016

(Please start a separate thread.) My last response explained why it's 4.

srowen · ‎08-17-2016

Yes, I don't think this is related, but the quick answer is that "cache" just means "cache this thing whenever you get around to computing it", and you are adding 2 records before it is computed. Hence count is 4, not 2.

srowen · ‎08-16-2016

5.5 or 5.7? the title and text disagree. 5.5 would have Spark 1.4, and I am not sure whether SQLContext was exposed as sqlContext by the shell by default like that. It should be in Spark 1.6 (= CDH 5.6+)

srowen · ‎08-10-2016

In that case I think it's a version problem. You have a very old version of Spark that may not even have this class. It's nothing to do with CDH per se. Actually: you shouldn't be packaging Spark with your app at all. And, you should find that this class is already part of the main assembly in more recent Spark versions. What if you omit this entirely and try the import? I just tried importing this in spark-shell in CDH 5.8 and it was available, without any additional jars.

srowen · ‎08-09-2016

You have a quite old version of Spark there by the way. You're showing interaction with the shell, but referring to a POM file, which is for a compiled app. You need to add the JARs to the spark-shell command-line in general to access them. I think in this old version of Spark the Kafka stuff was actually present in the examples uber jar; maybe just reference that.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Issue running spark application in Yarn-cluste...

Re: Multiple Spark version on the same cluster

Re: CDH 5.6

Re: CDH 5.6

Re: Oryx data expiration

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Re: sqlContext not started after spark-shell comma...

Re: spark.streaming.kafka.KafkaUtils error

Re: spark.streaming.kafka.KafkaUtils error