- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11
- Labels:
-
Apache Hadoop
Created on 11-28-2014 08:13 AM - edited 09-16-2022 08:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations).
With Hadoop computations I get this message:
Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results
Any hints, please?
Thank you.
Created 12-09-2014 12:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.
This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.
Created 12-05-2014 04:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm certain it's nothing to do with the input itself. It looks fine and those types of problem would be different.
Created 12-07-2014 08:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, I'm pretty certain I found the bug and fixed it here:
https://github.com/cloudera/oryx/commit/437df94d0b1c9d27b5c9f3b984b98973237d6f99
The factorization works as expected now. Are you able to test it too?
Created 12-07-2014 09:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ran a Hadoop and local computation and they were both fine this time, the
model is preserved, and the MAP estimate for the last iteration is
comparable to the local computation one (slightly smaller, presume due to
randomness).
I will continue to run a few more tests. I think there is still a slight
mismatch, as with Hadoop none of the datasets converged later than
iteration 8, but in memory takes a lot more to converge; for the bigger
datasets it needs close to 50. But I will repeat all my tests with the new
version and report back.
Created 12-07-2014 12:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think something is still wrong. When using cleaned_taste_preferences_rated_last_month.csv, Hadoop will converge at 2, with a MAP of 0.00x. In memory converges a lot later, and MAP is ~0.11. This happened every time.
Do you see the same thing?
Created 12-07-2014 02:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, I see it finish at 6 iterations with MAP about 0.15 on Hadoop. Same data set? double-check that you have the latest build and maybe start from scratch with no other intermediate results.
Created 12-07-2014 02:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, the other dataset - I uploaded it to Google drive a few days back, it's called cleaned_taste_preferences_rated_last_month.csv. It's a one month trace of production searches; when I'll go live with the recommender, I will train it on a similar dataset, but covering six months.
The ones I used to illustrate the initial problem with the rank, with names containing _11, contain just users with at least 11 different searches, and I used it just for development speed, as it is way smaller and will most likely converge.
Created 12-07-2014 02:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So to summarize, earlier, when I checked the bug was fixed, I wanted to do it fast so ran the in memory and Hadoop computations on the "11" dataset, Hadoop converged faster but MAPs were similar so I though it's alright.
Then I started checking the lenghtier but closer to reality dataset, and that's when the difference became clear.
Just some example of runs with in memory:
| Converged at | MAP score (at 10) |
| 11 | 0.1058467372 |
| 27 | 0.1177788843 |
| 32 | 0.1187595734 |
| 18 | 0.1202960727 |
| 31 | 0.1206682346 |
| 26 | 0.1208719179 |
| 20 | 0.1209679965 |
| 21 | 0.1224116387 |
Hadoop: tried 3 so far and they all converged at 2 with 0.00x MAPE.
Created 12-07-2014 02:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(...wish there was an edit post button)
Conf settings for both:
model.test-set-fraction=0.25
model.features=6
model.lambda=1
model.alpha=30
model.iterations.max=60
Latest version of Oryx: I definitely have it because I wiped out the folder, cloned again and built. Also I am not getting the insufficient rank behavior anymore.
Created 12-07-2014 03:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, I see that too. That's a different problem, although slightly more benign. It uses a sample of all data to estimate whether convergence has happened and here somehow the sample makes it looks like convergence has happened too early. I'll look into why the sample is regularly biased. You can work around by setting model.iterations.convergence-threshold to something low like 0.00001. Right now it's still running past 9 iterations on Hadoop and MAP is about 0.10, so that's the symptom, now to find the cause. Thanks for the issue reports, this data set has turned up some bugs.
Created 12-08-2014 02:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're welcome, and I should be thankful you're looking into this and that the project exists in the first place, and it's open source, and I don't have to write my own code for real time model updates.
If it's of any help, the dataset is weirdly shaped, pointed, if I can say that, because there is aggressive marketing around certain products and the majority of searches are centered around those. Users don't stay on the website too long and plenty of them have clicked on just one thing then left.
Another thing I have noticed, last week when trying to generate identical data with in memory and Hadoop, to see where things go different, by fixing the random generators seeds, was that I couldn't :). I presumed it was because of the in-proc vs. multiple processes execution - the Hadoop jobs (and presumed the random generators get 'reset' when launching a new process, if things like give me next random are used). Didn't dwell too much into it as you fixed the rank problems, but again, if it's of any help.