12-09-2014 09:07 AM
So the problem here is just that only a small number of user-item pairs are sampled to test convergence, and it turns out they consistently show a too-low estimate of convergence early on. A quick band-aid is to sample more, and log better messages about it: https://github.com/cloudera/oryx/commit/1ea63b4e493e1cfcf6d1cdc271c52befcdd12402 Too much sampling can slow things down kind of unnecessarily, and I've struggled to find a good heuristic that balances the two. This probably deserves a better bit of logic later, but this change will make this work fine, as will turning down the convergence threshold.
12-09-2014 11:09 AM
Great news, thank you. I will restart testing everything tomorrow.
In sort of related news - and if you've got time to explain - I've very briefly looked at the code changes but I still do not understand why this happens on Hadoop only?
12-09-2014 12:16 PM
The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.
This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.
12-10-2014 01:33 AM
I understand now, thank you. I will be running tests all day today on the Hadoop version and come back if there are any issues.
I am looking forward to see how this will scale, as my whole dataset has around 20 milion records; at this time I can't try it out as I do not have a Hadoop cluster (the company I work for will be able to give me some VMs only towards the end of January). (... I might be able to run this sooner on a Google compute API, but it's not sure).
12-11-2014 10:01 AM
Just to let you know the outcome of this, all of my tests yesterday with Hadoop, with various parameters, on the one month of searches dataset, went on fine.
I will not continue testing this further on the whole big dataset, as for the moment it looks like Hadoop is out of the picture, since I managed to get hold of a machine with 512GB of RAM which prooved up to the challange of running Oryx in memory. The dataset is 421MB, with roughly 20 million records, and it took just a few minutes to go through 29 iterations, so well done! Seemed like a big portion of time was spent writing the model (this is an SSD machine).
(I will continue further by looking at recommendations response times, how's that affected when I ingest users etc etc)
Thank you for the help with the bugs and all the explanations along the way.