About christinan

christinan · ‎03-31-2015

Hi In the system I am working on, every now and again a few user identifiers collapse into one, and the items associated with all of these user identifiers end up belonging to just one user id. Can I reflect these changes into Oryx in real time, for example feeding the items associations from the collapsed ids to the remaining id and deleting the old ids from the model? I am looking at knownItems but that does not bring back the ratings for the associations. Is there a simple way to achieve this in Oryx? I do have the option of retrieving the weights from the system that feeds the users associations to begin with but it's quite complicated so I'm hoping Oryx's own book keeping is actually easier to access!

christinan · ‎12-14-2014

I am sorry, in the first paraghraph I mean I call addPreference with -1, not removePreference (I just kick out whatever searches happened before the window).

christinan · ‎12-14-2014

I am sorry, I think I was unclear. For decaying old data I just call removePreference myself so there is no need for that to be done automatically by Oryx; that part is settled. But in some cases that I was not able to reproduce consistently, I found that subsequent generations do not contain data from a generation earlier than model.generations.keep, so the generations are not really cummulative (I did read the thread you had with Jason Chen about generations from some while back, but I posted because I noticed this different behavior). It only happens once in a while and I don't know how to trigger this. I only got it a few times in the past few days (e.g. today I was not able to reproduce this at all). I will come back if I manage to produce some reliable steps that lead to this. I do have a question on addPreference with negative values, what happens when we decreased enough to reach 0? I've tried that and the known items still remember these "nullified" items. recommend still returns results but they all have a strength of 0. Which I think is expected, as at this point we're supposed to know nothing about this user? For my particular use case, would be best if the items reaching 0 would move out of known items (and I can do that myself by calling removePreference) and, when all the items reached 0, the user would become unknown. Is there a method for this last part? Thank you.

christinan · ‎12-12-2014

Hi I would like to implement a sliding window type of data decay with Oryx. This is because I have 2 streams of searches coming in: - real time searches on the website - data that decays and is going in via removePreference The window is of, say, 6 months and every day I remove anything that's older than that, essentially ensuring that anything that goes into a model is from the past 6 months. I will be triggering one model rebuild per day. Now, my understanding of the generation model is that the latest generation is built out of the aggregation of all available generations up to this last one. This is because addPreference and removePreference will generate patches of the generation in use, save them in current+1, then triggering a model rebuild will generate current + 1 which contains all data we've got so far. All of this works fine up until we reach generation numbered model.generation.keep, as this will wipe out the initial set which contains the majority of the data and we're left with just the updates. If what I've explained above is what actually happens, how do I go about configurig Oryx to do the sliding window? Other than aggregating the files myself and restarting from 00000 every once in while? Thank you.

christinan · ‎12-11-2014

Hi Sean Just to let you know the outcome of this, all of my tests yesterday with Hadoop, with various parameters, on the one month of searches dataset, went on fine. I will not continue testing this further on the whole big dataset, as for the moment it looks like Hadoop is out of the picture, since I managed to get hold of a machine with 512GB of RAM which prooved up to the challange of running Oryx in memory. The dataset is 421MB, with roughly 20 million records, and it took just a few minutes to go through 29 iterations, so well done! Seemed like a big portion of time was spent writing the model (this is an SSD machine). (I will continue further by looking at recommendations response times, how's that affected when I ingest users etc etc) Thank you for the help with the bugs and all the explanations along the way.

christinan · ‎12-10-2014

I understand now, thank you. I will be running tests all day today on the Hadoop version and come back if there are any issues. I am looking forward to see how this will scale, as my whole dataset has around 20 milion records; at this time I can't try it out as I do not have a Hadoop cluster (the company I work for will be able to give me some VMs only towards the end of January). (... I might be able to run this sooner on a Google compute API, but it's not sure).

christinan · ‎12-09-2014

Great news, thank you. I will restart testing everything tomorrow. In sort of related news - and if you've got time to explain - I've very briefly looked at the code changes but I still do not understand why this happens on Hadoop only?

christinan · ‎12-08-2014

You're welcome, and I should be thankful you're looking into this and that the project exists in the first place, and it's open source, and I don't have to write my own code for real time model updates. If it's of any help, the dataset is weirdly shaped, pointed, if I can say that, because there is aggressive marketing around certain products and the majority of searches are centered around those. Users don't stay on the website too long and plenty of them have clicked on just one thing then left. Another thing I have noticed, last week when trying to generate identical data with in memory and Hadoop, to see where things go different, by fixing the random generators seeds, was that I couldn't :). I presumed it was because of the in-proc vs. multiple processes execution - the Hadoop jobs (and presumed the random generators get 'reset' when launching a new process, if things like give me next random are used). Didn't dwell too much into it as you fixed the rank problems, but again, if it's of any help.

christinan · ‎12-07-2014

(...wish there was an edit post button) Conf settings for both: model.test-set-fraction=0.25 model.features=6 model.lambda=1 model.alpha=30 model.iterations.max=60 Latest version of Oryx: I definitely have it because I wiped out the folder, cloned again and built. Also I am not getting the insufficient rank behavior anymore.

christinan · ‎12-07-2014

So to summarize, earlier, when I checked the bug was fixed, I wanted to do it fast so ran the in memory and Hadoop computations on the "11" dataset, Hadoop converged faster but MAPs were similar so I though it's alright. Then I started checking the lenghtier but closer to reality dataset, and that's when the difference became clear. Just some example of runs with in memory: Converged at MAP score (at 10) 11 0.1058467372 27 0.1177788843 32 0.1187595734 18 0.1202960727 31 0.1206682346 26 0.1208719179 20 0.1209679965 21 0.1224116387 Hadoop: tried 3 so far and they all converged at 2 with 0.00x MAPE.

Online	Offline
Last Visited	‎04-01-2015 10:01 AM

Member Since	‎11-27-2014 06:43 AM
Last Visited	‎04-01-2015 10:01 AM
Posts	32

Cloudera Community

"Merge" users in Oryx 1?

Re: Sliding windows and generations

Re: Sliding windows and generations

Sliding windows and generations

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...