About srowen

srowen · ‎12-14-2014

hm, yes that should not be how it works. If it hasn't decayed or been removed it will stick around forever. If you reach 0 the user-item pair will be removed. Negative values still have a meaning so are not removed so it's a question of how small the absolute value is. Yes, these are removed, including users that don't have any items.

srowen · ‎12-12-2014

model.generation.keep is really just a bookeeping setting. It just affects how many previous generations are kept around for whatever purpose they may serve -- backup, etc. Each generation has a copy of all data that's ever been seen, in aggregated form, and reduced by decay factor. So, no the default behavior is to keep all data forever. The decay factor is an indirect way to implement a "sliding window" in that it is a way to make old data go away eventually. It's not based on a hard time or generation limit, but I think that's desirable IMHO. The closest thing is to set a decay factor, and a zero threshold, such that roughly the desired number of generations decays a value of "1" to below the threshold.

srowen · ‎12-11-2014

This may be semantics. If you have no data, you can't make any recommendation, so we must be talking about starting from some data. In the normal case there is only user-item interaction data, but you have this side information you're incorporating before the first interaction. OK. You can modify the code to simply add the new entry to the map containing IDs and feature vectors. What's the issue with that? I assume you're already modifying the code. Yes you need to be careful about the locks but there is not much else to know.

srowen · ‎12-11-2014

When you add a user-item association, you have at least 1 data point for the user and item! Before that, you have no info at all. You can't make any recommendations no matter what approach you use. You can add feature vectors directly, sure, but how would you know what to add?

srowen · ‎12-09-2014

It's a bit complex due to all the locks (2.x is simpler in this regard) but you should be able to trace the logic from something like PreferenceServlet, which can add new users/items to the data structures.

srowen · ‎12-09-2014

The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things. This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.

srowen · ‎12-09-2014

So the problem here is just that only a small number of user-item pairs are sampled to test convergence, and it turns out they consistently show a too-low estimate of convergence early on. A quick band-aid is to sample more, and log better messages about it: https://github.com/cloudera/oryx/commit/1ea63b4e493e1cfcf6d1cdc271c52befcdd12402 Too much sampling can slow things down kind of unnecessarily, and I've struggled to find a good heuristic that balances the two. This probably deserves a better bit of logic later, but this change will make this work fine, as will turning down the convergence threshold.

srowen · ‎12-08-2014

Yes, when you run on YARN, you see the driver and executors as YARN containers. It is no longer a stand-alone service. You need to use master "yarn-client" or "yarn-cluster". yarn-client may be simpler to start. Have a look at http://spark.apache.org/docs/latest/cluster-overview.html

srowen · ‎12-07-2014

One late reply here: this bug fix may be relevant to the original problem: https://github.com/cloudera/oryx/issues/99 I'll put this out soon in 1.0.1

srowen · ‎12-07-2014

OK, I see that too. That's a different problem, although slightly more benign. It uses a sample of all data to estimate whether convergence has happened and here somehow the sample makes it looks like convergence has happened too early. I'll look into why the sample is regularly biased. You can work around by setting model.iterations.convergence-threshold to something low like 0.00001. Right now it's still running past 9 iterations on Hadoop and MAP is about 0.10, so that's the symptom, now to find the cause. Thanks for the issue reports, this data set has turned up some bugs.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Sliding windows and generations

Re: Sliding windows and generations

Re: Retrieve and modify latent feature vectors on ...

Re: Retrieve and modify latent feature vectors on ...

Re: Retrieve and modify latent feature vectors on ...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Spark on Yarn Vs Stand alone?

Re: Oryx ALS: X and Y do not have sufficient rank

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...