Support Questions

Find answers, ask questions, and share your expertise

Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

avatar
Explorer

Hi 

 

I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations). 

 

With Hadoop computations I get this message: 

 

Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results

 

 

Any hints, please?

 

Thank you.  

1 ACCEPTED SOLUTION

avatar
Master Collaborator

The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.

 

This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.

View solution in original post

34 REPLIES 34

avatar
Master Collaborator

So the problem here is just that only a small number of user-item pairs are sampled to test convergence, and it turns out they consistently show a too-low estimate of convergence early on. A quick band-aid is to sample more, and log better messages about it: https://github.com/cloudera/oryx/commit/1ea63b4e493e1cfcf6d1cdc271c52befcdd12402   Too much sampling can slow things down kind of unnecessarily, and I've struggled to find a good heuristic that balances the two. This probably deserves a better bit of logic later, but this change will make this work fine, as will turning down the convergence threshold. 

avatar
Explorer

Great news, thank you. I will restart testing everything tomorrow. 

 

In sort of related news - and if you've got time to explain - I've very briefly looked at the code changes but I still do not understand why this happens on Hadoop only?

avatar
Master Collaborator

The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.

 

This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.

avatar
Explorer

I understand now, thank you. I will be running tests all day today on the Hadoop version and come back if there are any issues.

 

 

I am looking forward to see how this will scale, as my whole dataset has around 20 milion records; at this time I can't try it out as I do not have a Hadoop cluster (the company I work for will be able to give me some VMs only towards the end of January). (... I might be able to run this sooner on a Google compute API, but it's not sure).

avatar
Explorer

Hi Sean

 

Just to let you know the outcome of this, all of my tests yesterday with Hadoop, with various parameters, on the one month of searches dataset, went on fine.

 

I will not continue testing this further on the whole big dataset, as for the moment it looks like Hadoop is out of the picture, since I managed to get hold of a machine with 512GB of RAM which prooved up to the challange of running Oryx in memory. The dataset is 421MB, with roughly 20 million records, and it took just a few minutes to go through 29 iterations, so well done! Seemed like a big portion of time was spent writing the model (this is an SSD machine).

(I will continue further by looking at recommendations response times, how's that affected when I ingest users etc etc)

 

Thank you for the help with the bugs and all the explanations along the way.