Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3444 | 01-26-2018 04:02 AM | |
7089 | 12-22-2017 09:18 AM | |
3538 | 12-05-2017 06:13 AM | |
3856 | 10-16-2017 07:55 AM | |
11224 | 10-04-2017 08:08 PM |
06-05-2016
07:20 AM
1 Kudo
Features means the number of latent features in the factored matrix model. If the user-item matrix A is factored as A ~= X Y', then the number of features f is the number of columsn of X and Y. Weights are not ratings. No, weights can be any value. One approach is to view any interaction at all as a "1". You might instead treat bad ratings and negative weights, and good ratings as positive weights.
... View more
05-24-2016
04:13 PM
1 Kudo
Yes all that is correct regarding how the CDH build works vs upstream and Kafka version. Although I admit I have not tried it directly it is my understanding that all this is so you can use security with Kafka and Spark Streaming.
... View more
05-21-2016
10:42 AM
Yes you will certainly need to provide access keys for S3 access to work. I don't think (?) that would be a solution to a VerifyError, which is a much lower-level error indicating corrupted builds. Yes, it's expected that AWS SDK dependencies were updated along with the new Spark version in CDH 5.7. I think the current version should depend on jets3t 0.9, which is the one you want.
... View more
05-21-2016
07:11 AM
That looks like a S3 library problem. The JVM says the bytecode itself is invalid. It is nothing to do with Spark per se. CDH does not support S3 although there's no particular reason it wouldn't work if you had the right libraries in place.
... View more
05-17-2016
01:08 PM
Typically you set this per job on the command line as args to spark-shell. If a setting is really something to establish as a default, you can update or point to a new, different spark-defaults.conf for your jobs. Advanced config snippets are for services, like the Spark history server, at least to my understanding. I'm not sure that would apply.
... View more
05-16-2016
03:11 PM
1 Kudo
Heh, that is a large part of what dynamic allocation was meant for, so you could have a long running process that could only consume resources when it's active. and a shell sitting open is a prime example of that. To some degree you can manage this via resource pools in YARN, and restrict a user, group or perhaps type of usage to a certain set of resources. This would be a pretty crude limit though, just a cap on the problem. Open shells would still keep resources. Timing out shells is tricky because you lose work and state; that's probably pretty surprising. Really you want dynamic allocation for this.
... View more
04-27-2016
03:27 AM
1 Kudo
The problem is that you have very few input data points -- 4, I'm guessing. maxBins > size of input doesn't make sense, so it's capped at the size of the input. But then, it also can't be less than the number of values for any categorical feature, since that implies it doesn't have permission to try all possible values. It's not obvious from the error (which is better in later versions than Spark 1.1 that you're using) but that's almost certainly the issue.
... View more
04-25-2016
04:10 PM
1 Kudo
Calculating a median or other quantiles is in general much harder than computing a moment like a mean. You want to look for functions like Spark that compute quantiles, rather than look for a median function -- median is the 0.5 quantile. There is an efficient approximate implementation for DataFrames in Spark.
... View more
04-25-2016
05:02 AM
Try "sc.emptyRDD[(String,Int)]"; currently the type of the RDD is not inferrable and so isn't obviously a match for the ones that it's unioned to later. But, you really want to use SparkContext.union here to union many RDDs. Make a Seq of them and then call once to union them.
... View more
04-20-2016
12:36 PM
1 Kudo
In this context, I imagine in practice it means "longer than Kerberos ticket expiration"
... View more