Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6199 | 03-09-2016 01:21 AM | |
5034 | 03-07-2016 01:52 AM | |
15084 | 02-29-2016 04:40 AM | |
4747 | 02-22-2016 03:08 PM | |
5750 | 01-19-2016 02:13 PM |
11-10-2014
02:43 PM
1 Kudo
Apologies, I'm mixing up 1.x and 2.x. The default evaluation metric in 1.x is mean average precision, or MAP. This is a measure of how much the top recommendations contained some items that were held out for the user. In local mode you can find lines like "Mean average precision: xxx" in the logs. In distributed mode, now that I review the code, I don't see that it is ever logged. It is written to a file called "MAP" under the subdirectory for the iteration. I can make the mapper workers output their own local value of MAP at least. In 2.x the metric is AUC, which is basically a measure of how likely it is that a 'good' recommendation (from the held out data set) ranks above a random item. It is a broader, different measure. This you should find printed in the logs if you're using 2.x for sure, along with hyperparams that yielded it.
... View more
11-10-2014
02:14 PM
It depends a lot on just what you mean by 'analysis'. 1 machine could be just fine. In general I think you will want to play with Spark, and Spark loves memory. So 8GB RAM seems a bit small, but 4 cores is OK, and I bet you have plenty of disk space. I do not agree at all that you need 10TB of disk space. That is orders of magnitude overkill for a 10GB data set.
... View more
11-05-2014
11:20 PM
Yes, in general you need to include JARs with your app that your app uses. This is true in Java in general. Are you certain you are using Zookeeper in your app? SPARK_CLASSPATH is an old mechanism. See Spark 1.0.0 docs, or better, update to CDH 5.2 / Spark 1.1 (although that's not the problem here). I don't think this is the best place for Maven help and am not sure what you are referring to regarding dependencies and a 64MB limit.
... View more
11-03-2014
10:32 AM
Yes they will usually be in [0,1] but not always. They aren't probabilities. They are entries in X*Y', yes. I think it's safe to take values >=1 as a very strong positive. What's a good cutoff? it really depends on your semantics and use case. They are comparable across models so you can probably determine a value empirically with some testing.
... View more
11-03-2014
10:30 AM
Yes, I meant the server side. Actually, looking at the source, it should allow up to 65536 bytes of header data.
... View more
11-02-2014
09:09 AM
Yeah, this is bumping up against practical limit of this simplistic API. The max URL length is set to be pretty big, like 8K IIRC. Is it possible that whatever item IDs you need can be looked up from an external source? There's always filtering on the caller side too although that has its own problems.
... View more
10-31-2014
11:39 AM
Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together. The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.
... View more
10-30-2014
04:38 AM
The spark app will run as whatever user you submitted it as, or should be. I would just make the directory writable to that user if at all possible.
... View more
10-26-2014
12:46 PM
It is not hard to expose, but seems like an internal implementation detail. The implementation already solves the cold start problem in a different way with fold-in. One issue with what you're suggesting is that there is no notion of attributes in the model. I assume you mean you have that externally. I understand the logic but it's a fairly different recommender model that you're making then. I think I'd direct you to just hack the code a bit. But I can keep this in mind in case several other use cases pop up that would make it make sense to just let the item vectors be set externally. The oryx2 design is much more decomposed so you could put in another process that feeds any item/user updates you want onto a queue of updates. But this is a ways from being ready.
... View more
10-22-2014
08:03 AM
That could suggest that the amount of resource that is available to your Spark jobs is not big enough to accommodate the resources that Talend or your app are requesting. I don't know whether you mean only 2 cores are available or 2 are requested, but the question is whether the request exceeds what's available. I'd check on this aspect. For example if running via YARN, see how much resource YARN can allocate and look at your logs to see what Spark thinks it's asking for.
... View more