Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Performance evaluation in Oryx

SOLVED Go to solution
Highlighted

Performance evaluation in Oryx

New Contributor

Hi guys,

 

I've just tried out the collaborative filtering example on Oryx Github page. Overall, the implementation was easy and straightforward.

 

I wonder if there's a way in Oryx to evaluate the recommender system's performance e.g. how relevant is the recommended item?

 

Many thanks,

Sally

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Performance evaluation in Oryx

Master Collaborator

Evaluating recommender systems is always a bit hard to do in the lab, since there's no real way to know which of all of the other items are actually the best recommendations. The best you can do is hold out some items the user has interacted with and see whether they rank highly in the recommendations later, after building a model on the rest of the data.

 

The ALS implementation does do this for you automatically. You will see an evaluation score printed in the logs as an "AUC" score or area under the curve. 1.0 is perfect; 0.5 is what random guessing would get. So you can get some sense there. Above 0.7 or so is probably "OK" but it depends.

 

That's about all the current version offers. In theory this helps you tune the model parameters since you can try a bunch of values of lambda, features, etc. and pick the best. But that's very manual.

 

The next 2.0 version being architected now is going to go much further, and build many models at once and pick the best one. So it will be able to pick hyperparameters automatically. 

4 REPLIES 4

Re: Performance evaluation in Oryx

Master Collaborator

Evaluating recommender systems is always a bit hard to do in the lab, since there's no real way to know which of all of the other items are actually the best recommendations. The best you can do is hold out some items the user has interacted with and see whether they rank highly in the recommendations later, after building a model on the rest of the data.

 

The ALS implementation does do this for you automatically. You will see an evaluation score printed in the logs as an "AUC" score or area under the curve. 1.0 is perfect; 0.5 is what random guessing would get. So you can get some sense there. Above 0.7 or so is probably "OK" but it depends.

 

That's about all the current version offers. In theory this helps you tune the model parameters since you can try a bunch of values of lambda, features, etc. and pick the best. But that's very manual.

 

The next 2.0 version being architected now is going to go much further, and build many models at once and pick the best one. So it will be able to pick hyperparameters automatically. 

Re: Performance evaluation in Oryx

New Contributor
Thanks Sean. That's very helpful. I'll give it a go.

Oryx 2.0 looks very exciting! Looking forward to hearing more about it soon!

Re: Performance evaluation in Oryx

New Contributor

Hi Sean,

 

I've come back to this post, and I'm trying to figure out what AUC is measuring here?

 

Also, I don't see the score being printed out in the logs on screen. Where should I look for the said score?

 

Regards,

Sally

Re: Performance evaluation in Oryx

Master Collaborator

Apologies, I'm mixing up 1.x and 2.x. The default evaluation metric in 1.x is mean average precision, or MAP. This is a measure of how much the top recommendations contained some items that were held out for the user. In local mode you can find lines like "Mean average precision: xxx" in the logs.

 

In distributed mode, now that I review the code, I don't see that it is ever logged. It is written to a file called "MAP" under the subdirectory for the iteration. I can make the mapper workers output their own local value of MAP at least.

 

In 2.x the metric is AUC, which is basically a measure of how likely it is that a 'good' recommendation (from the held out data set) ranks above a random item. It is a broader, different measure. This you should find printed in the logs if you're using 2.x for sure, along with hyperparams that yielded it.