Support Questions
Find answers, ask questions, and share your expertise

Handling aging

New Contributor

My stab at a movie recommendation sytem is not going as well as planned because I have not found a good way to deal with aging off older data.  For example, the recommender was populated with 5 years worth of sales and behavior activities that expressed either a positive or negative interest in movies.  By virtue of the fact that older movies simply have more data points, they flood the recommendations sets, particular for user's who have data going back 5 years.  But users are generally more interested in recommendations of newer movies.  


So, I created a series of decay jobs that run through all of the historic data and based on the age of the datapoint, calculate a deduction. As an extreme simplificaiton, if user A had purchased a movie 4, 3, 2 and 1 year ago as well as one movie this year (0) and movie purchase was worth 4 points, after running my initial decay job the users data would go from (movie/year:score):


4:4, 3:4, 2:4, 1:4, 0:4  to 4:0.5, 3:0.5, 2:1, 1:2, 0:4


I can further decay the values over time with incremental deductions.


This improves the quality of the recommendations in terms of newness and helps ensure that a person who bought a 4-year old movie this year is more closely associated with the behavior of others that did as well rather than those who bought the movie when it first came out, but it means that I need to store every datapoint that is sent to the server along with a datestamp so that I can decay the value according to the timeline I've chosen.  This becomes a scaling problem as I try to increase the touch points that represent positive or negative associates between users and movies.  


It would seem to me that this is a fairly typical type of problem for a recommender to have and wonder if there are more elegant ways of acheiving this.




Master Collaborator

Yes indeed, aging is built in as the 'decay factor' setting. See


This controls how much each generation decays, and at what point it is considered equal to 0.


It is driven by generation rather than timestamp. This is essentially the same if generations run at regular intervals, but do note that. It also only works when running on Hadoop.