Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Solved Go to solution

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

I'm certain it's nothing to do with the input itself. It looks fine and those types of problem would be different.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

OK, I'm pretty certain I found the bug and fixed it here:

 

https://github.com/cloudera/oryx/commit/437df94d0b1c9d27b5c9f3b984b98973237d6f99

 

The factorization works as expected now. Are you able to test it too?

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer
Thank you so much for fixing this.

Ran a Hadoop and local computation and they were both fine this time, the
model is preserved, and the MAP estimate for the last iteration is
comparable to the local computation one (slightly smaller, presume due to
randomness).

I will continue to run a few more tests. I think there is still a slight
mismatch, as with Hadoop none of the datasets converged later than
iteration 8, but in memory takes a lot more to converge; for the bigger
datasets it needs close to 50. But I will repeat all my tests with the new
version and report back.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

I think something is still wrong. When using cleaned_taste_preferences_rated_last_month.csv, Hadoop will converge at 2, with a MAP of 0.00x. In memory converges a lot later, and MAP is ~0.11. This happened every time.

 

Do you see the same thing?

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

No, I see it finish at 6 iterations with MAP about 0.15 on Hadoop. Same data set? double-check that you have the latest build and maybe start from scratch with no other intermediate results.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

No, the other dataset - I uploaded it to Google drive a few days back, it's called cleaned_taste_preferences_rated_last_month.csv. It's a one month trace of production searches; when I'll go live with the recommender, I will train it on a similar dataset, but covering six months. 

 

The ones I used to illustrate the initial problem with the rank, with names containing _11, contain just users with at least 11 different searches, and I used it just for development speed, as it is way smaller and will most likely converge. 

 

 

 

 

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

So to summarize, earlier, when I checked the bug was fixed, I wanted to do it fast so ran the in memory and Hadoop computations on the "11" dataset, Hadoop converged faster but MAPs were similar so I though it's alright.

 

Then I started checking the lenghtier but closer to reality dataset, and that's when the difference became clear. 

 

Just some example of runs with in memory:

Converged atMAP score (at 10)
110.1058467372
270.1177788843
320.1187595734
180.1202960727
310.1206682346
260.1208719179
200.1209679965
210.1224116387

 

Hadoop: tried 3 so far and they all converged at 2 with 0.00x MAPE. 

Highlighted

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

(...wish there was an edit post button)

 

Conf settings for both:

 

model.test-set-fraction=0.25

model.features=6
model.lambda=1
model.alpha=30
model.iterations.max=60

 

Latest version of Oryx: I definitely have it because I wiped out the folder, cloned again and built. Also I am not getting the insufficient rank behavior anymore. 

 

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

OK, I see that too. That's a different problem, although slightly more benign. It uses a sample of all data to estimate whether convergence has happened and here somehow the sample makes it looks like convergence has happened too early. I'll look into why the sample is regularly biased. You can work around by setting model.iterations.convergence-threshold to something low like 0.00001. Right now it's still running past 9 iterations on Hadoop and MAP is about 0.10, so that's the symptom, now to find the cause. Thanks for the issue reports, this data set has turned up some bugs.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

You're welcome, and I should be thankful you're looking into this and that the project exists in the first place, and it's open source, and I don't have to write my own code for real time model updates.

 

If it's of any help, the dataset is weirdly shaped, pointed, if I can say that, because there is aggressive marketing around certain products and the majority of searches are centered around those. Users don't stay on the website too long and plenty of them have clicked on just one thing then left.

 

Another thing I have noticed, last week when trying to generate identical data with in memory and Hadoop, to see where things go different, by fixing the random generators seeds, was that I couldn't :). I presumed it was because of the in-proc vs. multiple processes execution - the Hadoop jobs (and presumed the random generators get 'reset' when launching a new process, if things like give me next random are used). Didn't dwell too much into it as you fixed the rank problems, but again, if it's of any help. 

 

Don't have an account?
Coming from Hortonworks? Activate your account here