What makes a good recommender system?

“I think you should move to Australia. You will be a lot happier there!”.

How do you measure the quality of such a recommendation? In our tongue and cheek example, the basic approach would be to let a recommender system choose a large number of people, say 1,000, whom, from the recommender system’s perspective, will be happier in Australia. Then split them in half, relocate the first half to Australia, and ask all of them: “Are you happier now?”, and compare the responses of the two groups. Essentially, use a randomized experiment such as an A/B test (assuming measures of statistical significance are met, of course).

In more realistic situations, recommender systems will recommend you something a bit less life-changing such as products to buy (hopefully, you know where you want to be living!). But the question remains the same: how do you measure the quality of a product recommender? Ideally, we would be running an A/B split test for each change to a recommender system, which is how many high-volume web-based systems work such as However, not all uses of recommenders have such high sample rates. For example, a traditional brick and mortar CRM program may send out recommendation campaigns on a monthly basis. In these situations, it is not feasible to test every small change to the system, rather, live tests are reserved for bigger more significant changes. So then the question still remains but modified: how do you measure the quality of a product recommender (when you can’t A/B test)?

Let’s suppose our recommender did a perfect job: a new great product was recommended to a user. The user was surprised since she had wanted such a product for a long time and had no idea that it’s available. But, for some reasons, she did not purchase the product after the recommendation campaign. Maybe she did not have the budget at the moment, or decided to do this later, or just purchased this product from another account, or even, from another merchant? Does it really mean that our recommender is bad?

When designing a product recommender system, an essential precursor to a live test is to tune the parameters of the model before we send the recommendations to users. The classical machine learning approach is to use cross-validation; to split the dataset into training and test portions, train a model on the training dataset and measure the model performance on the test dataset. Such an approach works well if you train a classifier – when you have a label in front of every row in the dataset. But it is far from perfect in case of a recommender system. Ideally, what we need is true score (how strong a user likes the item) in front of every user-item pair. Instead of this we have just purchase history. However the output of the model can still be viewed as a prediction of sorts. One way to measure the prediction error is to compare predicted values with the purchases. We can assume that every purchase translates to the fact that the user likes this item with the maximum score(1.0). The error is the difference between the predictions and the value of 1.0 for the purchased items. The trivial error statistic would be Root Mean Square Error metric (RMSE):

where n is the number of purchases in the dataset, y is prediction score of the product/user pair. Notice tough, that this metric has a flaw: it does not encourage a model to recommend novel products, only ones they have purchased in the past.

As we mentioned, the fact that recommended products were not purchased, does not necessarily mean that the recommender model is bad. This probably just means that users don’t know about those products yet. But this is exactly the job of the recommender: to find those items!

In fact, the splitting the dataset does not work well for recommender systems. It turns out that the data we need to measure the performance is not in the dataset at all! To address this problem, we can try to introduce more metrics beyond RMSE. Here are a few of the additional high-level metrics that we use internally to understand how the recommender is behaving, and a little bit about them.

Diversity is defined as the inverse similarity of recommendation lists between all the users [1]. It’s preferable to recommend a product set which is unique for the user and reflects his/her preferences. For two users i and j , the distance between their lists can be calculated as:

where С is the number of common items recommended to both users, and N is size of recommendation lists. Inter list diversity is the average distance between all pairs of test users:

where L is the number of users.
Diversity is zero if we recommend the same items to everybody. Diversity is one if all the recommendation lists don’t have a single item in common.

A recommendation list becomes more novel as the user is less likely to know the existence of the items in the list. We define self-information based novelty as the measure for novelty relative to popularity of the items [2]. The assumption is that popular items provide less novelty. High novelty values correspond to long-tail items few users have interacted with, and low novelty values correspond to popular head items.

where K is number of total recommended items for all users, P is item popularity calculated as the portion of users who purchased item i.

Precision at N
The average portion of items from the recommendation lists purchased by users. N denotes the length of the recommendation list.

where L is the number of users, Purchased is the number of items from recommendation list purchased by user i, N is size of recommendation list.

It’s more complicated to use these additional metrics, since there is an inherent tradeoff between them. For example, if we try to maximize “Precision at N” alone, we may end up recommending the most popular items. This is a textbook example of recommending milk and bananas only because 99% of people buy them. “Diversity” and “Novelty” metrics will help to avoid these types of situations. In fact, part of the product roadmap for our LifeCycle Manager product aims to expose some level of control across these measures back to business users – as they are the ones most equipped to tailor Diversity to Novelty ratios. Moreover, as a merchant looking to cross-sell customers into new categories, or up-sell them into premium brands, creative applications of these ratios can go a long way to accomplishing that.

The following graph demonstrates how we use Diversity and Novelty when we optimize model parameters. Two clear extremums in Novelty and Diversity give us an idea about the optimum number of features in our model :

Let’s say we are confident enough and ready to deploy our model. Now we are able to measure the real performance. The simple way is to measure the increase of sales for different models. To a retailer, the sales (or other related business metrics) is the most important thing. But sales don’t capture everything. As we already showed, increase of sales is just a side effect of a good recommender. A better way to evaluate a recommender would be to solicit feedback from customers directly, and perhaps incentivize them for that feedback (in the form of discounts, or gifts). Another option is to present two recommendation lists and ask customers to rate these two lists against each other in exchange of a reward – as relative difference is often a better indicator of value than absolute ratings.

In any case, whatever you do to measure the performance of a recommender system, you should always keep in mind that the data you need is not in your dataset. Strictly speaking, even if you are lucky and you got the feedback from the user, even this is not perfect. Users themselves might not know what product they would enjoy.

In other words, the fact that you did not immigrate to Australia does not necessarily mean that the advice was not good.

[1] Amin Javari, Mahdi Jalili. “A probabilistic model to resolve diversity-accuracy challenge of recommendation systems“ in Knowledge and Information Systems, 2015, Volume 44, Issue 3, pp 609–627

[2] Saúl Vargas, Pablo Castells. “Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems” in Recommendation Systems, Chicago, Illinois, USA, 2011, pp. 109-116.