Created on 11-28-2014 08:13 AM - edited 09-16-2022 08:39 AM
Hi
I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations).
With Hadoop computations I get this message:
Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results
Any hints, please?
Thank you.
Created 12-09-2014 12:16 PM
The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.
This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.
Created 12-05-2014 06:12 AM
I reproduced this. Everything else works fine, you can see the model generates a MAP of about 0.15 on Hadoop. It's just the last step where it seems to incorrectly decide the rank is insufficient. There is always a bit of heuristic here; nothing is ever literally "singular" due to machine precision. So it could be a case of loosening or improving the heuristic. I'll have to debug a little more.
Created 12-05-2014 07:46 AM
Hi
Thank you so much for looking into this!
If it's of any help with the debugging, I managed to reproduce this with an even smaller subset of data and I have uploaded the file to the drive. And that the X and Y, before getting deleted, are indeed made up of a lot smaller numbers when run with Hadoop.
Created 12-05-2014 08:19 AM
Is this data synthetically generated? I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated.
Created 12-05-2014 09:39 AM
It is not, the last smallest dataset is a six month trace of user clicks through the system (well a quarter of searches, just to have a smaller and faster test).
It does contain just the users with at least 11 different product searches though, because, when adding users with fewer clicks, datasets get a lot bigger and I'm getting the JVM exception where too much time is spent garbage collecting.
I have a bunch of others that I have tried, withouth this constraint on the min number of searched items, e.g. a one month trace of all users and all their clicks, that one again does not work with Hadoop (but in memory had a ~0.12 MAP) .
Created 12-05-2014 09:44 AM
I have uploaded to Google drive the one which traces all user clicks, including the one time visitors, if it's of any help.
Created 12-05-2014 10:12 AM
@srowen wrote:I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated.
Say that were true, I should get the insufficient rank at least once when running the in memory computation, isn't it? Given I'm chasing this for a few days now, I ran the in memory things over and over again and it did not happen.
If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?
Created 12-05-2014 10:42 AM
@christinan wrote:
If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?
Sorry, maybe I spoke to soon. For Hadoop, it looks like the first column is close to 0 for both X and Y so...
Created 12-05-2014 12:40 PM
You are right that it's unlikely that the earlier computations would work if the data was low rank. OK synthetic data is ruled out.
It's not quite X or Y that is singular or nonsingular, it's X'*X and Y'*Y. Small absolute values in the matrices are normal.
Created 12-05-2014 01:56 PM
Strange, I do indeed get much different answers on the Hadoop version and they don't look quite right. The first row and column are very small and there's no good reason for that. I'll keep digging in to see where things to funny. The fact that MAP is good suggests that the model is good during iteration but something funny happens at the end.
Created 12-05-2014 03:20 PM
Yes, exactly, and it happens too deterministically to be random, so to speak. On my end I've tried all sorts of things in an attempt to corner the cause and take as many things out of the equation as possible, I disabled compression in mapreduce-site.xml, I changed the input data to use quotes, or the ratings from x, integers, to x.01, LF to CRLF, UTF-8 with BOM or without it etc. Changed even the dataset itself, by taking different slices of the data, and quite different, the one where I include just users with more searches looks a lot different than the one with all users with all their searches. Also tried reducing factors. None of this had any effect.
But on the other hand, Movielens and Audio Scrobbler work fine, so a weird one... I'm very grateful you're investigating this, I'm certainly running out of things to try.
The value 0.15 that you see in the last iteration is what I usually get with the local computation for the "11" datasets.