About srowen

Nishit · ‎08-31-2016

Thanks, it did fix the issue. Got rid of the variable from the code, I am able to execute it in cluster mode.

srowen · ‎08-26-2016

It has always been documented in "Known Issues": https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html Generally speaking, there aren't differences. Not supported != different. However there are some pieces that aren't shipped like the thrift server and SparkR. Usually differences crop up when upstream introduces a breaking change and it can't be followed in a minor release. For example: default in CDH is for the "legacy" memory config parameters to be active so that default memory config doesn't change in 1.6. Sometimes it relates to other stuff in the platfrom that can't change, like I think the Akka version is (was) different because other stuff in Hadoop needed a different version. The biggest example of this IMHO is Spark Streaming + Kafka. Spark 1.x doesn't support Kafka 0.9+ but CDH 5.7+ had to move to it to get security features. So CDH Spark 1.6 will actually only work with 0.9+ because the Kafka differences are mutually incompatible. Good in that you can use recent Kafka, but, a difference! Most of it though are warnings about incompatibilities between what Spark happens to support and what CDH ships in other components.

srowen · ‎08-18-2016

max-age-data-hours will cause it to delete data on HDFS that is older than this number of hours. This means that subsequent models will be built on historical data that does not include data older than this time. That's all there is to it.

srowen · ‎08-16-2016

5.5 or 5.7? the title and text disagree. 5.5 would have Spark 1.4, and I am not sure whether SQLContext was exposed as sqlContext by the shell by default like that. It should be in Spark 1.6 (= CDH 5.6+)

MSharma · ‎08-10-2016

Thanks i will move to latest version then.

srowen · ‎08-07-2016

The first operation makes each value into a set containing that single value. ++ just adds collections together, combining elements of both sets. This is trying to build up a set of all values for each key. It can be written more simply as "groupByKey" really. Even this code could be more compact and efficient.

Tdas · ‎07-23-2016

i typed spark-shell and i got scala console

MVERVUURT · ‎07-06-2016

I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step. * http://quant-econ.net/py/ipython.html#debugging * https://docs.python.org/3/library/pdb.html Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython. Good luck.

srowen · ‎06-27-2016

I have a guess: you need to make each of those things a separate arg tag? I don't know Oozie well myself, but something similar is needed in Maven config files. That is it may be reading this as one arg not two, called "-xm mapreduce"

srowen · ‎06-13-2016

Yes it would be. The execution of transformations/actions is the same, just the source is different.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Issue running spark application in Yarn-cluste...

Re: CDH 5.6

Re: Oryx data expiration

Re: sqlContext not started after spark-shell comma...

Re: spark.streaming.kafka.KafkaUtils error

Re: reduceByKey(_ ++ _)

Re: Error while importing log data from webserver ...

Re: PYSPARK import pandas

Re: Getting error while running mahout job through...

Re: Reliable method for keeping counters in Spark