Updated: Aug 8
Collaborative Filtering Overview
Collaborative Filtering is one of the most traditional and simpler way for recommendation. It’s basic principle relies on capturing local patterns.While doing CF also there can be two approaches like user to user CF or Item to Item.The later one is more successful and widely used in industry as it mostly outperforms the former one. The explanation is as follows:
a) Items are simpler than users and belongs to small set of genres as compared to user who can have varied taste
b) Item Similarity is inherently more meaningful than User Similarity.
Item-based CF mainly contains two main components:
Item similarity computation
Item similarity computation is to build a similar-items table by computing the similarity scores for each pair of items. Assume we have n users and m items in the recommender system, the ratings of users for items can be expressed as an n ∗ m matrix R, where each r(p,q) represents the user u(p)’s rating for item i(q). An item can be represented as a vector of ratings that n users give to it. There exist many approaches to measure the similarity between items. Here we use cosine similarity as our measure due to its popularity and simplicity. It is defined as follows.
where U denotes the set of users. Specially, if user u does not rate the item i(p), the rating r(u,p) is considered as zero.
2. Prediction computation
To recommend items for a user, we need to predict the ratings he would give for the unseen items. The prediction of the target item for a user is given by the weighted average rating of its neighbors given by the user:
Now we have the understanding of a simpler naive solution for recommendation so let’s discuss the challenges in implementation part of it when we do it in scale like a huge utility matrix like (m * n)
To use the item-based CF in practice, design of the algorithm should be based on the three requirements, i.e., “big”, “real-time” and “accurate”. We will go through these using the below listed problems.
Problem 1 : Implicit Feedback Problem
Explicit feedback are not always enough to achieve considerable accuracy for a recommendation algorithm due to its sparse nature.
The majority of the abundant data is implicit feedback, i.e., user behaviors that indirectly reflect user opinions. Since we can only guess user’s preferences according to the user behaviors, implicit feedback may reduce the recommender system’s performance if not appropriately handled.
There are various types of user behaviors in our scenario, including click, browse, purchase, share, comment, etc. Different user actions represent different degrees of user’s interests, and should have different influences on the recommendation algorithm. Therefore, we set different weights to different action types. For example, a browse behavior may correspond to a one star rating while a purchase behavior corresponds to a five star rating. Specifically, for an item that the user demonstrates multiple interests, i.e., multiple behaviors, we take the weight of action with max weight as the user’s rating for the item, which can reduce the noise brought by the various messy implicit feedback.
In short we say a user rates or likes an item when the user demonstrates implicit interests on the item.
Using co-rating helps us to leverage the implicit feedback to get more accurate results.
The co-rating that a user gives to item i(p) and item i(q) is defined as the min weight of user-item rating
while using co-rating, similarity function defined above need to be modified to make sure the similarity score fall in range [0,1]
To tackle data sparsity problem one more common approach is based on demographic clustering. Usually users in a demographic group generally share similar interests or preferences, so cluster users into different demographic groups according to their properties such as gender, age and education and finally run the recommendation algorithms in the demographic user groups, we will get a more refined model and produce more accurate results.
Below image is for representation purpose to show the shaded regions as cluster based on demography which shows less sparsity as compared to complete user * item matrix
Problem 2: Scalable Incremental Update
The real-time item based CF algorithm faces challenge of frequent update of the similar items table, which is caused by the changing of user ratings. In the traditional recommender systems, the static periodical update strategy is adopted. However, the periodical update cannot meet the requirements of the real-time recommendation, which is a fast changing scenario. To capture the users’ real-time interests, the incremental update strategy is preferred for item-based CF algorithm.
Inspired by that incremental update mechanism, the similarity score of an item pair an be splitted into three parts in practical CF algorithm, as follows:
The paircount and itemcount can be naturally incrementally updated. We can independently calculate the new values of paircount and itemcount.As a result, the similarity score can be updated incrementally as follows:
where sim(ip, iq) ′ represents the new similarity score after a new observation (user, item, rating) is received.
Three layers can be used to compute the real-time incremental item-to-item similarities in parallel like below:
Layer 1: Captures <User,Item,Action> grouped by userid
Layer 2: Stores Historical User behaviour like co-rating and old ratings and identify these changed ratings or coratings that need to be recalculated by comparing the new ratings or co-ratings with the old ones. item or item pairs generated together with their changed weights, i.e., ∆r(up) and ∆co-rating(ip, iq) to the next layer
Layer 3: Recalculates the new Similarity scores based on deltas from layer 2
Problem 3: Computational Complexity
Take an example, each user has more than ten items rated in average everyday. There will be ten item pairs generated to update for each user action. Each item pair update needs several computations, resulting in massive computations for big number of user actions.
The item pairs that two items are considered as related only if users rate them together within a certain period and we call it as Linked Time. Normally these linked time would have duration in days for ecommerce sites like 3 days or a week in that situation a single user leads to hundreds of computations, most of which are not necessary since a large portion of the generated item pairs are not so similar that only the items in N(k) (ip) in Prediction computation equation are useful for our prediction, thus leads to much resource cost.
To conquer the above situation we utilize the Hoeffding bound theory and develop a real-time pruning technique. The theory states as follows : let x be a real-valued random variable whose range is R (for the similarity score the range is one). Suppose we have made n independent observations of this variable, and computed their mean xˆ. The Hoeffding bound states that, with probability 1 − δ, the true mean of the variable is at most xˆ + ϵ, where
we take the similarity scores of an item pair at different time points as random variables. Let t be the threshold of an item’s similar-items list, i.e., the minimum similarity score among the similar items N(k) (ip). Given a desired δ, the Hoeffding bound guarantees that the correct similarity score of the item pair is no more than t with probability 1−δ, if n updates have been seen at the moment and ϵ < t − xˆ. In other words, after we observe some item pairs’ similarity update operations, we can prune the dissimilar item pairs, i.e., the items not able to be in N(k) (ip). Since the similarity computation involves two items, the pruning is bidirectional. We take the min threshold of the two items’ similar-items lists as t to do pruning computation.
Problem 4: Cold-start
Cold start is well known problem which occurs when a new user enters your system. To handle this we can use clustered demographic groups explained at problem 1 , which provides group information to further understand users. Specifically, for users who have few historical data, the demographic based algorithm (DB) is used to further complement the recommendation results. To be specific, we compute each demographic group’s hot items, i.e., the most popular items in the group. For the situation where the commonly used algorithms like item-based CF and CB cannot effectively generate good recommendations, such as for a new user or those users especially inactive, we can utilize the results of DB algorithm and recommend the hot items to user.
Problem 5: Hot Item Problem
It’s a visiting inequality problem between hot items and cold ones. Let us take an example the demand of masks and sanitizer like items were all time high in covid time so these items were considered hot items during that phase on ecommerce sites. Due to this visiting inequality the massive amount of transformation and computation will cause the worker responsible for processing the hot items to be overburdened, leading to the extreme allocation unbalance among the workers or the crash of the worker. To handle these combiner technique to solve this problem. Combiner is simply a map which buffers coming tuples of <User,Item,Action> and do partial merging with same key, combine operation may be increment, addition or maximization.
Take the itemCount statistics as example, we will observe numerous reading actions that users visits same hot item. We first add these users’ rating changed weights in combiner, and then put the overall group rating changed weights into store. By doing all these the number of reads and writes can be reduced significantly to store. Since the combine operations are much lighter than the reads and writes.
We briefly discussed upon Collaborative filtering and then focused upon tackling five problems which commonly occurs while building real time CF based large scale system and then discussed their solutions.