So the problem here is just that only a small number of user-item pairs are sampled to test convergence, and it turns out they consistently show a too-low estimate of convergence early on. A quick band-aid is to sample more, and log better messages about it: https://github.com/cloudera/oryx/commit/1ea63b4e493e1cfcf6d1cdc271c52befcdd12402 Too much sampling can slow things down kind of unnecessarily, and I've struggled to find a good heuristic that balances the two. This probably deserves a better bit of logic later, but this change will make this work fine, as will turning down the convergence threshold.
Great news, thank you. I will restart testing everything tomorrow.
In sort of related news - and if you've got time to explain - I've very briefly looked at the code changes but I still do not understand why this happens on Hadoop only?
The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.
This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.
I understand now, thank you. I will be running tests all day today on the Hadoop version and come back if there are any issues.
I am looking forward to see how this will scale, as my whole dataset has around 20 milion records; at this time I can't try it out as I do not have a Hadoop cluster (the company I work for will be able to give me some VMs only towards the end of January). (... I might be able to run this sooner on a Google compute API, but it's not sure).
Just to let you know the outcome of this, all of my tests yesterday with Hadoop, with various parameters, on the one month of searches dataset, went on fine.
I will not continue testing this further on the whole big dataset, as for the moment it looks like Hadoop is out of the picture, since I managed to get hold of a machine with 512GB of RAM which prooved up to the challange of running Oryx in memory. The dataset is 421MB, with roughly 20 million records, and it took just a few minutes to go through 29 iterations, so well done! Seemed like a big portion of time was spent writing the model (this is an SSD machine).
(I will continue further by looking at recommendations response times, how's that affected when I ingest users etc etc)
Thank you for the help with the bugs and all the explanations along the way.