About srowen

srowen · ‎11-10-2014

Apologies, I'm mixing up 1.x and 2.x. The default evaluation metric in 1.x is mean average precision, or MAP. This is a measure of how much the top recommendations contained some items that were held out for the user. In local mode you can find lines like "Mean average precision: xxx" in the logs. In distributed mode, now that I review the code, I don't see that it is ever logged. It is written to a file called "MAP" under the subdirectory for the iteration. I can make the mapper workers output their own local value of MAP at least. In 2.x the metric is AUC, which is basically a measure of how likely it is that a 'good' recommendation (from the held out data set) ranks above a random item. It is a broader, different measure. This you should find printed in the logs if you're using 2.x for sure, along with hyperparams that yielded it.

srowen · ‎11-10-2014

It depends a lot on just what you mean by 'analysis'. 1 machine could be just fine. In general I think you will want to play with Spark, and Spark loves memory. So 8GB RAM seems a bit small, but 4 cores is OK, and I bet you have plenty of disk space. I do not agree at all that you need 10TB of disk space. That is orders of magnitude overkill for a 10GB data set.

srowen · ‎11-05-2014

Yes, in general you need to include JARs with your app that your app uses. This is true in Java in general. Are you certain you are using Zookeeper in your app? SPARK_CLASSPATH is an old mechanism. See Spark 1.0.0 docs, or better, update to CDH 5.2 / Spark 1.1 (although that's not the problem here). I don't think this is the best place for Maven help and am not sure what you are referring to regarding dependencies and a 64MB limit.

srowen · ‎11-03-2014

Yes they will usually be in [0,1] but not always. They aren't probabilities. They are entries in X*Y', yes. I think it's safe to take values >=1 as a very strong positive. What's a good cutoff? it really depends on your semantics and use case. They are comparable across models so you can probably determine a value empirically with some testing.

srowen · ‎11-03-2014

Yes, I meant the server side. Actually, looking at the source, it should allow up to 65536 bytes of header data.

srowen · ‎11-02-2014

Yeah, this is bumping up against practical limit of this simplistic API. The max URL length is set to be pretty big, like 8K IIRC. Is it possible that whatever item IDs you need can be looked up from an external source? There's always filtering on the caller side too although that has its own problems.

srowen · ‎10-31-2014

Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together. The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.

srowen · ‎10-30-2014

The spark app will run as whatever user you submitted it as, or should be. I would just make the directory writable to that user if at all possible.

srowen · ‎10-26-2014

It is not hard to expose, but seems like an internal implementation detail. The implementation already solves the cold start problem in a different way with fold-in. One issue with what you're suggesting is that there is no notion of attributes in the model. I assume you mean you have that externally. I understand the logic but it's a fairly different recommender model that you're making then. I think I'd direct you to just hack the code a bit. But I can keep this in mind in case several other use cases pop up that would make it make sense to just let the item vectors be set externally. The oryx2 design is much more decomposed so you could put in another process that feeds any item/user updates you want onto a queue of updates. But this is a ways from being ready.

srowen · ‎10-22-2014

That could suggest that the amount of resource that is available to your Spark jobs is not big enough to accommodate the resources that Talend or your app are requesting. I don't know whether you mean only 2 cores are available or 2 are requested, but the question is whether the request exceeds what's available. I'd check on this aspect. For example if running via YARN, see how much resource YARN can allocate and look at your logs to see what Spark thinks it's asking for.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Performance evaluation in Oryx

Re: 3 Nodes Hadoop Cluster

Re: Getting exception => java.lang.NoClassDefFound...

Re: How to use advanced functions for the /recomme...

Re: How to use advanced functions for the /recomme...

Re: How to use advanced functions for the /recomme...

Re: Best way to dump all data from Spark Streaming...

Re: Attempting to write remote Spark Streaming job...

Re: Retrieve and modify latent feature vectors on ...

Re: Spark Error Remote