I'm certain it's nothing to do with the input itself. It looks fine and those types of problem would be different.
OK, I'm pretty certain I found the bug and fixed it here:
The factorization works as expected now. Are you able to test it too?
I think something is still wrong. When using cleaned_taste_preferences_rated_last_month.csv, Hadoop will converge at 2, with a MAP of 0.00x. In memory converges a lot later, and MAP is ~0.11. This happened every time.
Do you see the same thing?
No, I see it finish at 6 iterations with MAP about 0.15 on Hadoop. Same data set? double-check that you have the latest build and maybe start from scratch with no other intermediate results.
No, the other dataset - I uploaded it to Google drive a few days back, it's called cleaned_taste_preferences_rated_last_month.csv. It's a one month trace of production searches; when I'll go live with the recommender, I will train it on a similar dataset, but covering six months.
The ones I used to illustrate the initial problem with the rank, with names containing _11, contain just users with at least 11 different searches, and I used it just for development speed, as it is way smaller and will most likely converge.
So to summarize, earlier, when I checked the bug was fixed, I wanted to do it fast so ran the in memory and Hadoop computations on the "11" dataset, Hadoop converged faster but MAPs were similar so I though it's alright.
Then I started checking the lenghtier but closer to reality dataset, and that's when the difference became clear.
Just some example of runs with in memory:
|Converged at||MAP score (at 10)|
Hadoop: tried 3 so far and they all converged at 2 with 0.00x MAPE.
(...wish there was an edit post button)
Conf settings for both:
Latest version of Oryx: I definitely have it because I wiped out the folder, cloned again and built. Also I am not getting the insufficient rank behavior anymore.
OK, I see that too. That's a different problem, although slightly more benign. It uses a sample of all data to estimate whether convergence has happened and here somehow the sample makes it looks like convergence has happened too early. I'll look into why the sample is regularly biased. You can work around by setting model.iterations.convergence-threshold to something low like 0.00001. Right now it's still running past 9 iterations on Hadoop and MAP is about 0.10, so that's the symptom, now to find the cause. Thanks for the issue reports, this data set has turned up some bugs.
You're welcome, and I should be thankful you're looking into this and that the project exists in the first place, and it's open source, and I don't have to write my own code for real time model updates.
If it's of any help, the dataset is weirdly shaped, pointed, if I can say that, because there is aggressive marketing around certain products and the majority of searches are centered around those. Users don't stay on the website too long and plenty of them have clicked on just one thing then left.
Another thing I have noticed, last week when trying to generate identical data with in memory and Hadoop, to see where things go different, by fixing the random generators seeds, was that I couldn't :). I presumed it was because of the in-proc vs. multiple processes execution - the Hadoop jobs (and presumed the random generators get 'reset' when launching a new process, if things like give me next random are used). Didn't dwell too much into it as you fixed the rank problems, but again, if it's of any help.