About srowen

srowen · ‎11-16-2016

It's in the same repo, not a different 'beta' repo. For example, see: https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/2.0.0.cloudera.beta2/

srowen · ‎11-16-2016

(Which link are you referring to?)

srowen · ‎11-07-2016

I suspect you'll find your version of Spark's examlpe uses twitter4j 3.x. Just don't bundle it yourself. It ought to be in the examples .jar.

srowen · ‎11-07-2016

It's not the same problem. Here you have put an incompatible version of twitter4j on the classpath.

srowen · ‎09-26-2016

The essential point here is that you want to avoid a shuffle, and you can avoid a shuffle if both RDDs are partitioned in the same way, because then all values for the same key are already on 1 partition in each RDD. join calls cogroup so yes both can accomplish this, as long as both RDDs have the same partitioner. This won't be true, however, if you first flatMap one of the RDDs which can't be known to retain the partitioning.

srowen · ‎09-07-2016

That one's actually easy. As it says your codec doesn't have a constructor accepting a SparkConf.

srowen · ‎09-07-2016

--jars isn't quite relevant here, as it will just put classes in the same classloader as if you'd packaged the classes with your app. The key is that there are many classloaders at play here and only some can see user classes. IIRC codec classes are the most problematic because they're necessary within Spark. You should post more details about the failure. Also try the "user classpath first" options.

srowen · ‎09-07-2016

Well, the data that is stored certainly affects future models. All historical data is used to build models.

srowen · ‎09-06-2016

This affects historical input data and model data stored on HDFS only. Every time the batch layer runs it will check the data versus these settings and delete old data/models if they're older than the given age. This does not affect the age of data stored in Kafka topics. The input topic's retention doesn't matter much; just needs to be long enough so that the batch process still sees all data since the last batch by the time it runs. The update topic retention should also be long enough such that at least one model is retained somewhere in the topic. It too should be at least as long as the batch interval. If it's too long, then the speed/serving processes will waste time sifting through old data on startup to catch up. The effect of deleting old input data is that this data will no longer be used in building future models. There's really no effect of deleting old models, with one exception. In some cases a model is stored on HDFS but is too large to send via Kafka, in which case a reference to its HDFS location is stored. If a model is deleted from HDFS but is still referenced on the Kafka update topic then it will be ignored. That's no big deal, but, I suppose it means you shouldn't delete old models too aggressively. batch interval < topic retention times < max age settings is a good general rule.

srowen · ‎09-03-2016

Do you just mean you want to copy that file to your local machine? hdfs dfs -get [file]

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Maven Repository for Spark2.0 beta?

Re: What dependencies to submit Spark jobs program...

Re: Lost executor error

Re: Lost executor error

Re: join two grouped by key RDD's

Re: Where does spark-submit look for Jar files?

Re: Where does spark-submit look for Jar files?

Re: Oryx max-age params

Re: Oryx max-age params

Re: Copy the contents of "output/ part-00000" in a...