Created on 11-28-2014 08:13 AM - edited 09-16-2022 08:39 AM
Hi
I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations).
With Hadoop computations I get this message:
Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results
Any hints, please?
Thank you.
Created 12-09-2014 12:16 PM
The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.
This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.
Created 12-05-2014 04:38 PM
I'm certain it's nothing to do with the input itself. It looks fine and those types of problem would be different.
Created 12-07-2014 08:31 AM
OK, I'm pretty certain I found the bug and fixed it here:
https://github.com/cloudera/oryx/commit/437df94d0b1c9d27b5c9f3b984b98973237d6f99
The factorization works as expected now. Are you able to test it too?
Created 12-07-2014 09:58 AM
Created 12-07-2014 12:31 PM
I think something is still wrong. When using cleaned_taste_preferences_rated_last_month.csv, Hadoop will converge at 2, with a MAP of 0.00x. In memory converges a lot later, and MAP is ~0.11. This happened every time.
Do you see the same thing?
Created 12-07-2014 02:10 PM
No, I see it finish at 6 iterations with MAP about 0.15 on Hadoop. Same data set? double-check that you have the latest build and maybe start from scratch with no other intermediate results.
Created 12-07-2014 02:22 PM
No, the other dataset - I uploaded it to Google drive a few days back, it's called cleaned_taste_preferences_rated_last_month.csv. It's a one month trace of production searches; when I'll go live with the recommender, I will train it on a similar dataset, but covering six months.
The ones I used to illustrate the initial problem with the rank, with names containing _11, contain just users with at least 11 different searches, and I used it just for development speed, as it is way smaller and will most likely converge.
Created 12-07-2014 02:28 PM
So to summarize, earlier, when I checked the bug was fixed, I wanted to do it fast so ran the in memory and Hadoop computations on the "11" dataset, Hadoop converged faster but MAPs were similar so I though it's alright.
Then I started checking the lenghtier but closer to reality dataset, and that's when the difference became clear.
Just some example of runs with in memory:
Converged at | MAP score (at 10) |
11 | 0.1058467372 |
27 | 0.1177788843 |
32 | 0.1187595734 |
18 | 0.1202960727 |
31 | 0.1206682346 |
26 | 0.1208719179 |
20 | 0.1209679965 |
21 | 0.1224116387 |
Hadoop: tried 3 so far and they all converged at 2 with 0.00x MAPE.
Created 12-07-2014 02:35 PM
(...wish there was an edit post button)
Conf settings for both:
model.test-set-fraction=0.25
model.features=6
model.lambda=1
model.alpha=30
model.iterations.max=60
Latest version of Oryx: I definitely have it because I wiped out the folder, cloned again and built. Also I am not getting the insufficient rank behavior anymore.
Created 12-07-2014 03:59 PM
OK, I see that too. That's a different problem, although slightly more benign. It uses a sample of all data to estimate whether convergence has happened and here somehow the sample makes it looks like convergence has happened too early. I'll look into why the sample is regularly biased. You can work around by setting model.iterations.convergence-threshold to something low like 0.00001. Right now it's still running past 9 iterations on Hadoop and MAP is about 0.10, so that's the symptom, now to find the cause. Thanks for the issue reports, this data set has turned up some bugs.
Created 12-08-2014 02:09 AM
You're welcome, and I should be thankful you're looking into this and that the project exists in the first place, and it's open source, and I don't have to write my own code for real time model updates.
If it's of any help, the dataset is weirdly shaped, pointed, if I can say that, because there is aggressive marketing around certain products and the majority of searches are centered around those. Users don't stay on the website too long and plenty of them have clicked on just one thing then left.
Another thing I have noticed, last week when trying to generate identical data with in memory and Hadoop, to see where things go different, by fixing the random generators seeds, was that I couldn't :). I presumed it was because of the in-proc vs. multiple processes execution - the Hadoop jobs (and presumed the random generators get 'reset' when launching a new process, if things like give me next random are used). Didn't dwell too much into it as you fixed the rank problems, but again, if it's of any help.