Support Questions

Find answers, ask questions, and share your expertise

Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

avatar
Explorer

Hi 

 

I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations). 

 

With Hadoop computations I get this message: 

 

Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results

 

 

Any hints, please?

 

Thank you.  

1 ACCEPTED SOLUTION

avatar
Master Collaborator

The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.

 

This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.

View solution in original post

34 REPLIES 34

avatar
Master Collaborator

I'm certain it's nothing to do with the input itself. It looks fine and those types of problem would be different.

avatar
Master Collaborator

OK, I'm pretty certain I found the bug and fixed it here:

 

https://github.com/cloudera/oryx/commit/437df94d0b1c9d27b5c9f3b984b98973237d6f99

 

The factorization works as expected now. Are you able to test it too?

avatar
Explorer
Thank you so much for fixing this.

Ran a Hadoop and local computation and they were both fine this time, the
model is preserved, and the MAP estimate for the last iteration is
comparable to the local computation one (slightly smaller, presume due to
randomness).

I will continue to run a few more tests. I think there is still a slight
mismatch, as with Hadoop none of the datasets converged later than
iteration 8, but in memory takes a lot more to converge; for the bigger
datasets it needs close to 50. But I will repeat all my tests with the new
version and report back.

avatar
Explorer

I think something is still wrong. When using cleaned_taste_preferences_rated_last_month.csv, Hadoop will converge at 2, with a MAP of 0.00x. In memory converges a lot later, and MAP is ~0.11. This happened every time.

 

Do you see the same thing?

avatar
Master Collaborator

No, I see it finish at 6 iterations with MAP about 0.15 on Hadoop. Same data set? double-check that you have the latest build and maybe start from scratch with no other intermediate results.

avatar
Explorer

No, the other dataset - I uploaded it to Google drive a few days back, it's called cleaned_taste_preferences_rated_last_month.csv. It's a one month trace of production searches; when I'll go live with the recommender, I will train it on a similar dataset, but covering six months. 

 

The ones I used to illustrate the initial problem with the rank, with names containing _11, contain just users with at least 11 different searches, and I used it just for development speed, as it is way smaller and will most likely converge. 

 

 

 

 

avatar
Explorer

So to summarize, earlier, when I checked the bug was fixed, I wanted to do it fast so ran the in memory and Hadoop computations on the "11" dataset, Hadoop converged faster but MAPs were similar so I though it's alright.

 

Then I started checking the lenghtier but closer to reality dataset, and that's when the difference became clear. 

 

Just some example of runs with in memory:

Converged atMAP score (at 10)
110.1058467372
270.1177788843
320.1187595734
180.1202960727
310.1206682346
260.1208719179
200.1209679965
210.1224116387

 

Hadoop: tried 3 so far and they all converged at 2 with 0.00x MAPE. 

avatar
Explorer

(...wish there was an edit post button)

 

Conf settings for both:

 

model.test-set-fraction=0.25

model.features=6
model.lambda=1
model.alpha=30
model.iterations.max=60

 

Latest version of Oryx: I definitely have it because I wiped out the folder, cloned again and built. Also I am not getting the insufficient rank behavior anymore. 

 

avatar
Master Collaborator

OK, I see that too. That's a different problem, although slightly more benign. It uses a sample of all data to estimate whether convergence has happened and here somehow the sample makes it looks like convergence has happened too early. I'll look into why the sample is regularly biased. You can work around by setting model.iterations.convergence-threshold to something low like 0.00001. Right now it's still running past 9 iterations on Hadoop and MAP is about 0.10, so that's the symptom, now to find the cause. Thanks for the issue reports, this data set has turned up some bugs.

avatar
Explorer

You're welcome, and I should be thankful you're looking into this and that the project exists in the first place, and it's open source, and I don't have to write my own code for real time model updates.

 

If it's of any help, the dataset is weirdly shaped, pointed, if I can say that, because there is aggressive marketing around certain products and the majority of searches are centered around those. Users don't stay on the website too long and plenty of them have clicked on just one thing then left.

 

Another thing I have noticed, last week when trying to generate identical data with in memory and Hadoop, to see where things go different, by fixing the random generators seeds, was that I couldn't :). I presumed it was because of the in-proc vs. multiple processes execution - the Hadoop jobs (and presumed the random generators get 'reset' when launching a new process, if things like give me next random are used). Didn't dwell too much into it as you fixed the rank problems, but again, if it's of any help.