Created on 08-28-2014 09:05 AM - edited 09-16-2022 08:15 AM
Hi guys,
I've just tried out the collaborative filtering example on Oryx Github page. Overall, the implementation was easy and straightforward.
I wonder if there's a way in Oryx to evaluate the recommender system's performance e.g. how relevant is the recommended item?
Many thanks,
Sally
Created 08-28-2014 09:12 AM
Evaluating recommender systems is always a bit hard to do in the lab, since there's no real way to know which of all of the other items are actually the best recommendations. The best you can do is hold out some items the user has interacted with and see whether they rank highly in the recommendations later, after building a model on the rest of the data.
The ALS implementation does do this for you automatically. You will see an evaluation score printed in the logs as an "AUC" score or area under the curve. 1.0 is perfect; 0.5 is what random guessing would get. So you can get some sense there. Above 0.7 or so is probably "OK" but it depends.
That's about all the current version offers. In theory this helps you tune the model parameters since you can try a bunch of values of lambda, features, etc. and pick the best. But that's very manual.
The next 2.0 version being architected now is going to go much further, and build many models at once and pick the best one. So it will be able to pick hyperparameters automatically.
Created 08-28-2014 09:12 AM
Evaluating recommender systems is always a bit hard to do in the lab, since there's no real way to know which of all of the other items are actually the best recommendations. The best you can do is hold out some items the user has interacted with and see whether they rank highly in the recommendations later, after building a model on the rest of the data.
The ALS implementation does do this for you automatically. You will see an evaluation score printed in the logs as an "AUC" score or area under the curve. 1.0 is perfect; 0.5 is what random guessing would get. So you can get some sense there. Above 0.7 or so is probably "OK" but it depends.
That's about all the current version offers. In theory this helps you tune the model parameters since you can try a bunch of values of lambda, features, etc. and pick the best. But that's very manual.
The next 2.0 version being architected now is going to go much further, and build many models at once and pick the best one. So it will be able to pick hyperparameters automatically.
Created 08-28-2014 09:42 AM
Created 11-10-2014 09:57 AM
Hi Sean,
I've come back to this post, and I'm trying to figure out what AUC is measuring here?
Also, I don't see the score being printed out in the logs on screen. Where should I look for the said score?
Regards,
Sally
Created 11-10-2014 02:43 PM
Apologies, I'm mixing up 1.x and 2.x. The default evaluation metric in 1.x is mean average precision, or MAP. This is a measure of how much the top recommendations contained some items that were held out for the user. In local mode you can find lines like "Mean average precision: xxx" in the logs.
In distributed mode, now that I review the code, I don't see that it is ever logged. It is written to a file called "MAP" under the subdirectory for the iteration. I can make the mapper workers output their own local value of MAP at least.
In 2.x the metric is AUC, which is basically a measure of how likely it is that a 'good' recommendation (from the held out data set) ranks above a random item. It is a broader, different measure. This you should find printed in the logs if you're using 2.x for sure, along with hyperparams that yielded it.