Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3043 | 01-26-2018 04:02 AM | |
6425 | 12-22-2017 09:18 AM | |
3099 | 12-05-2017 06:13 AM | |
3351 | 10-16-2017 07:55 AM | |
9585 | 10-04-2017 08:08 PM |
08-31-2016
06:47 AM
No, I've never seen such a variable defined by Spark. You can probably look up "spark.master" in the SparkConf. But you don't need to query it in order to make a SparkContext in your app. It looks like you might have modified a standard Spark example, in which case just undo those changes.
... View more
08-31-2016
01:22 AM
1 Kudo
There's actually not a notion of 'installing Spark on the cluster', really. It's a big JAR file that gets run along with some user code in a YARN container. For example, yesterday I took the vanilla upstream Apache Spark 2.0.0 (+ Hadoop 2.7) binary distribution, unpacked it on one cluster node, (and set HADOOP_CONF_DIR,) and was able to run the Spark 2.0.0 shell on a CDH 5.8 cluster with no further changes. Not everything works out of the box, like anything touching the Hive metastore, which would require a little more tweaking / config. But that's about it for 'installation', at heart. Note this is of course not supported, but, it's also something you can try without modifying any installation, which of course you would never want to do.
... View more
08-26-2016
09:18 AM
It has always been documented in "Known Issues": https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html Generally speaking, there aren't differences. Not supported != different. However there are some pieces that aren't shipped like the thrift server and SparkR. Usually differences crop up when upstream introduces a breaking change and it can't be followed in a minor release. For example: default in CDH is for the "legacy" memory config parameters to be active so that default memory config doesn't change in 1.6. Sometimes it relates to other stuff in the platfrom that can't change, like I think the Akka version is (was) different because other stuff in Hadoop needed a different version. The biggest example of this IMHO is Spark Streaming + Kafka. Spark 1.x doesn't support Kafka 0.9+ but CDH 5.7+ had to move to it to get security features. So CDH Spark 1.6 will actually only work with 0.9+ because the Kafka differences are mutually incompatible. Good in that you can use recent Kafka, but, a difference! Most of it though are warnings about incompatibilities between what Spark happens to support and what CDH ships in other components.
... View more
08-25-2016
10:00 AM
See my reply above. You'd be surprised how many people complain about shipping things that aren't supported. It's about as many that complain about not shipping things that aren't supported. Specific to R: Shipping or otherwise arranging to install R is a small barrier because it is GPL and can't ship with CDH. This ultimately isn't a big barrier. Supportability is also a moderate issue. It's not trivial to get the whole support machine able to actually provide support for a new environment and technology, and R is not just another big data tool. Again that's more a question of effort. Maturity is a moderate issue. The API continued to change over Spark 1.x. For a while you could dapply code across the cluster, then it was removed, then it was added back. It's more an argument that this sort of thing is hard to support rather than ship but these things are linked. Lastly it's really demand. People do seem interested in "parallelizing R code" but it's not what SparkR does. They also use 3rd party tools like H2O + R or Revo. It hasn't been something people actually want to pay for support on.
... View more
08-18-2016
01:59 AM
1 Kudo
max-age-data-hours will cause it to delete data on HDFS that is older than this number of hours. This means that subsequent models will be built on historical data that does not include data older than this time. That's all there is to it.
... View more
08-17-2016
09:47 AM
1 Kudo
(Please start a separate thread.) My last response explained why it's 4.
... View more
08-17-2016
09:28 AM
Yes, I don't think this is related, but the quick answer is that "cache" just means "cache this thing whenever you get around to computing it", and you are adding 2 records before it is computed. Hence count is 4, not 2.
... View more
08-16-2016
08:09 AM
5.5 or 5.7? the title and text disagree. 5.5 would have Spark 1.4, and I am not sure whether SQLContext was exposed as sqlContext by the shell by default like that. It should be in Spark 1.6 (= CDH 5.6+)
... View more
08-10-2016
08:43 AM
1 Kudo
In that case I think it's a version problem. You have a very old version of Spark that may not even have this class. It's nothing to do with CDH per se. Actually: you shouldn't be packaging Spark with your app at all. And, you should find that this class is already part of the main assembly in more recent Spark versions. What if you omit this entirely and try the import? I just tried importing this in spark-shell in CDH 5.8 and it was available, without any additional jars.
... View more
08-09-2016
03:43 PM
You have a quite old version of Spark there by the way. You're showing interaction with the shell, but referring to a POM file, which is for a compiled app. You need to add the JARs to the spark-shell command-line in general to access them. I think in this old version of Spark the Kafka stuff was actually present in the examples uber jar; maybe just reference that.
... View more