About srowen

srowen · ‎06-05-2016

Features means the number of latent features in the factored matrix model. If the user-item matrix A is factored as A ~= X Y', then the number of features f is the number of columsn of X and Y. Weights are not ratings. No, weights can be any value. One approach is to view any interaction at all as a "1". You might instead treat bad ratings and negative weights, and good ratings as positive weights.

srowen · ‎05-24-2016

Yes all that is correct regarding how the CDH build works vs upstream and Kafka version. Although I admit I have not tried it directly it is my understanding that all this is so you can use security with Kafka and Spark Streaming.

srowen · ‎05-21-2016

Yes you will certainly need to provide access keys for S3 access to work. I don't think (?) that would be a solution to a VerifyError, which is a much lower-level error indicating corrupted builds. Yes, it's expected that AWS SDK dependencies were updated along with the new Spark version in CDH 5.7. I think the current version should depend on jets3t 0.9, which is the one you want.

srowen · ‎05-21-2016

That looks like a S3 library problem. The JVM says the bytecode itself is invalid. It is nothing to do with Spark per se. CDH does not support S3 although there's no particular reason it wouldn't work if you had the right libraries in place.

srowen · ‎05-17-2016

Typically you set this per job on the command line as args to spark-shell. If a setting is really something to establish as a default, you can update or point to a new, different spark-defaults.conf for your jobs. Advanced config snippets are for services, like the Spark history server, at least to my understanding. I'm not sure that would apply.

srowen · ‎05-16-2016

Heh, that is a large part of what dynamic allocation was meant for, so you could have a long running process that could only consume resources when it's active. and a shell sitting open is a prime example of that. To some degree you can manage this via resource pools in YARN, and restrict a user, group or perhaps type of usage to a certain set of resources. This would be a pretty crude limit though, just a cap on the problem. Open shells would still keep resources. Timing out shells is tricky because you lose work and state; that's probably pretty surprising. Really you want dynamic allocation for this.

srowen · ‎04-27-2016

The problem is that you have very few input data points -- 4, I'm guessing. maxBins > size of input doesn't make sense, so it's capped at the size of the input. But then, it also can't be less than the number of values for any categorical feature, since that implies it doesn't have permission to try all possible values. It's not obvious from the error (which is better in later versions than Spark 1.1 that you're using) but that's almost certainly the issue.

srowen · ‎04-25-2016

Calculating a median or other quantiles is in general much harder than computing a moment like a mean. You want to look for functions like Spark that compute quantiles, rather than look for a median function -- median is the 0.5 quantile. There is an efficient approximate implementation for DataFrames in Spark.

srowen · ‎04-25-2016

Try "sc.emptyRDD[(String,Int)]"; currently the type of the RDD is not inferrable and so isn't obviously a match for the ones that it's unioned to later. But, you really want to use SparkContext.union here to union many RDDs. Make a Seq of them and then call once to union them.

srowen · ‎04-20-2016

In this context, I imagine in practice it means "longer than Kerberos ticket expiration"

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx ALS Collaborative filtering essentials

Re: Spark and Kafka broker with SSL (or Kerberos) ...

Re: CDH 5.7.0 Spark Streaming S3 Error

Re: CDH 5.7.0 Spark Streaming S3 Error

Re: Idle Spark Shells

Re: Idle Spark Shells

Re: IllegalArgumentException: requirement failed: ...

Re: calculating median on grouped data

Re: union rdd with emptyrdd

Re: What is considered a long running Spark Stream...