TechnoCalifornia

Where am I?

2016-02-02T10:35:00.001-08:00

There have recently been some articles (e.g. This list of influencers) that have pointed to this blog and lamented that I don't update it regularly anymore. It is true. I now realize I should have at least posted something here to direct readers to the places where I keep posting in case they find I might have something interesting to say.

First and foremost, given that I joined Quora about a year ago, I have been using the Quora product itself to post most of my writing. You can find my profile here. I have found that I can reformulate almost anything I want to say in the form of an answer to a Quora question. Besides, my posts there get a ton of views (I am almost about to reach 2 million views in about a year) and good interactions. Also, I have written some posts in the Quora Engineering Blog describing some of our work.

I also keep very active on Twitter, and every now and then I will update my LinkedIn with some professional posts.

Recently, I gave Medium a try. I am not really sure how often I will update my blog there, but I am almost certain that my Medium blog will take precedence over this one. Medium is indeed a much better blogging platform than Blogger.

So, yes, I guess this is a farewell to Technocalifornia unless every now and then I decide to simply post a collection of posts elsewhere just to make sure that people visiting this blog don't get the sense that I am not active anymore. Let me know if you feel that would be interesting for you.

Ten Lessons Learned from Building (real-life impactful) Machine Learning Systems

2014-12-18T22:50:00.000-08:00

(This is a blogpost version of a talk I gave at MLConf SF 11/14/2014. See below for original video and slides)

There are many good textbooks and courses where you can be introduced to machine learning and maybe even learn some of the most intricate details about a particular approach or algorithm (See my answer on Quora on what are good resources for this). While understanding that theory is a very important base and starting point, there are many other practical issues related to building real-life ML systems that you don’t usually hear about. In this post I will share some of the most important lessons learned in years of building large-scale ML solutions that power products such as Netflix and scale to millions of users across many countries.

And just in case it doesn't come across clearly enough, let me insist on this once again: it does pay off to be knowledgeable and have deep understanding of the techniques and theory behind classic and modern machine learning approaches. Understanding how Logistic Regression works or the difference between Factorization Machines and Tensor Factorization, for example, is a necessary starting point. However, this in itself might not be enough unless you couple it with the real-life experience of how these models interact with systems, data, and users in order to obtain a really valuable impact. The next ten lessons are my attempt at trying to capture some of that practical knowledge.

1. More Data vs. and Better Models

A lot has been written about whether the key to better results lays in improving your algorithms or simply on throwing more data at your problem (see my post from 2012 discussing this same topic, for example).

In the context of the Netflix Prize, Anand Rajaraman took an early stand on the issue by claiming that "more data usually beats better algorithms". In his post he explained how some of his students had improved some of the existing results on the Netflix ratings dataset by adding metadata from IMDB.

Fig 1. More data usually beats better algorithms

Although many teams in the competition tried to follow that lead and add extra features to improve results, there was little progress in that direction. As a matter of fact, just a year later some of the leaders of what would become the runner up team published a paper in which they showed that adding metadata had very little impact in improving the prediction accuracy of a well-tuned algorithm. Take this as a first example of why adding more data is not always the solution.

Fig 2. Even a Few Ratings Are More Valuable than Metadata

Of course, there are different ways to "add more data". In the example above we were adding data by increasing the number and types of features, therefore increasing the dimensionality of our problem space. We can think about adding data in a completely different way by fixing the space dimensionality and simply throwing more training examples at it. Banko and Brill showed in 2001 that in some cases very different algorithms responded equally well by improving to more training data (see figure below)

Fig 3. Banko and Brill's "famous" model performance curves

Google's Research Director and renowned AI figure Peter Norvig is quoted as saying that "Google does not have better algorithms, just more data". In fact, Norvig is one of the co-authors of "The Unreasonable Effectiveness of Data" where in a similar problem to the one in Banko and Brill (language understanding) they also show how important it is to have "more data".

Fig 4. The Unreasonable Effectiveness of Data

So, is it true that more data in the form of more training examples will always help? Well, not really. The problems above are complex models with a huge number of features which lead to situations of "high variance". But, in many other cases this might not be true. See below for example a real-case scenario of an algorithm in production at Netflix. In this case, adding more than 2 million training examples has very little to no effect.

Fig 5. Testing Accuracy of a real-life production model

So, this leads to our first lesson learned, which in fact will expand over several of the following ones: it is not about more data versus better algorithms. That is a false dichotomy. Sometimes you need more data, and sometimes you don't. Sometimes you might need to improve your algorithm and in others it will make no difference. Focusing exclusively on one or the other will lead to far from optimal results.

2. You might not need all your "Big Data"

This second lesson is in fact a corollary of the previous one, but I feel it is worth to mention explicitly on its own. It seems like nowadays everyone needs to make use of all their "Big Data". Big Data is so hyped that it seems like if you are not using huge quantities of data you must be doing something wrong. The truth though, as discussed in lesson 1, is that there are many problems for which you might be able to get similar results by using much less data than the one you have available.

Think for example of the Netflix Prize where you had 0.5 Million users in the dataset. In the most favored approach, the data was used to compute a Matrix of 50 factors. Would the result change much if instead of the 0.5 M users you used, say 50 Million? Probably not.

A related, and important, question is how do you determine what subset of your data to use. A good initial approach would be to random sample your original data to obtain as many samples you need for your model training. That might not be good enough though. Staying with the Netflix Prize example, users might be very different and not homogeneously distributed in our original population. New users, for example, will have much fewer ratings and increase sparsity in the dataset. On the other hand, they might have a different behavior from more tenured users and we might want to make our model capture it. The solution is to use some form of stratified sampling. Setting up a good stratified sampling scheme is not easy since it requires us to define the different strata, and decide what is the right combination of samples for the model to learn. However, as surprising as it might sound, a well-defined stratified sampled subset might accomplish even better results than the original complete dataset.

Just to be clear, I am not saying that having lots of data is a bad thing, of course it is not. The more data you have, the more choices you will be able to make on how to use it. All I am saying is that focusing on the "size" of your data versus the quality of the information in the data is a mistake. Garner the ability to use as much data as you can in your systems and then use only as much as you need to solve your problems.

3. The fact that a more complex Model does not improve things does not mean you don't need one

Imagine the following scenario: You have a linear model and for some time you have been selecting and optimizing features for that model. One day you decide to try a more complex (e.g. non-linear) model with the same features you have been engineering. Most likely, you will not see any improvement.

After that failure, you change your strategy and try to do the opposite: You keep the old model, but
add more expressive features that try to capture more complex interactions. Most likely the result will be the same and you will again see little to no improvements.

So, what is going on? The issue here is that simply put more complex features require a more complex model, and vice versa, a more complex model may require more complex features before showing any significant improvement.

So, the lesson learned is that you must improve both your model and your feature set in parallel. Doing only one of them at a time might lead to wrong conclusions.

4. Be thoughtful about how you define your training/testing data sets

If you are training a simple binary classifier, one of the first tasks to do is to define your positive and negative examples. Defining positive and negative labels for samples though may not be such a trivial task. Think about a use case where you need to define a classifier to distinguish between shows that users watch (positives) and do not watch (negatives). In that context, would the following be positives or negatives?

User watches a movie to completion and rates it 1 star
User watches the same movie again (maybe because she can’t find anything else)
User abandons movie after 5 minutes, or 15 minutes… or 1 hour
User abandons TV show after 2 episodes, or 10 episode… or 1 season
User adds something to her list but never watches it

As you can see, determining whether a given example is a positive or a negative is not so easy.
Besides paying attention to your positive and negative definition, there are many other things you need to make sure to get right when defining your training and testing datasets. One such issue is what we call Time Travelling. Time traveling is defined as usage of features that originated after the event you are trying to predict. E.g. Your rating a movie is a pretty good prediction of you watching that movie, especially because most ratings happen AFTER you watch the movie.
In simple cases as the example above this effect might seem obvious. However, things can get very tricky when you have many features that come from different sources and pipelines and relate to each other in non-obvious ways. Time traveling has the effect of increasing model performance beyond what would seem reasonable. That is why whenever you see an offline experiment with huge wins, the first question you might want to ask yourself is: “Am I time traveling?”.

And, remember, Time Traveling and positive/negative selection are just two examples of issues you might encounter when defining your training and testing datasets. Just make sure you are thoughtful about how you define all the details of your datasets.

5. Learn to deal with (the curse of) the Presentation Bias

Fig 6. Example of an Attention Model on a page

Let's face it, users can only click and act on whatever your algorithm (and other parts of your system) has decided to show them. Of course, what your algorithm decided to show is what it predicted was good for the user. Let's suppose that a new user comes in and we decide to show the user only popular items. The fact that a week later the user has only consumed popular items does not mean that's what the user like. That's the *only* thing she had a chance to consume!

As many (including myself) have mentioned in the past, is important to take that into account in your algorithms and try to somehow break this "Curse of the Presentation Bias". Most approaches to addressing this issue are based on the idea that you should "punish" items that were showed to the user but not "clicked on". One way to do so is by implementing some presentation discounting mechanism (see this KDD 2014 paper by the LinkedIn folks).

Another way to address the issue is to use viewed but not clicked items as negatives in your training process. This, in principle, makes sense: if a user searched for a query and ended up clicking in result number three it means the first two results were bad and should be treated as negatives... or not? The problem with this is that although the first two items were likely worse than the third one (at least in that particular context), this does not mean they were any worse than item in position 4, let alone item in position 5000, which your original model decided was no good at all. Yes, you want to remove the presentation bias, but not all of it since it responds to some hopefully well-informed decisions your model took in the first place.

So, what can we do? First thing that comes to mind is to introduce some sort of randomization in the original results. This randomization should allow to collect unbiased user feedback so as to whether those items are good or not (see some of the early publications by Thorsten Joachims such as this one or take a look at the idea of result dithering proposed by Ted Dunning).

Another better approach is to develop some sort of "attention model" of the user. In this case both clicked and non-clicked items will be weighted by the probability that the user noticed them in the first place depending on their location on the page (see some of the recent work by Dmitry Lagun for interesting ideas on this area.

Finally, yet another and well established way to address presentation bias is by using some sort of explore/exploit approach, in particular multi-armed bandits. By using a method such as Thompson Sampling, you can introduce some form of "randomization" on the items that you are still not sure about, while still exploiting as much as you can from what you already know for sure (see Deepak Argawal's Explore/Exploit approach to recommendations or one of the many publications by Thorsten Joashims for more details on this).

6. The UI is the only communication channel between the Algorithm and what matters most: the Users

Fig 7. The UI is the algorithm's connection point with the user

From the discussion in the previous lesson it should be clear by now how important it is to think about the presentation layer and the user interface in our machine learning algorithmic design. On the one hand, the UI generates all the user feedback that we will use as input to our algorithms. On the other hand, the UI is the only place where our algorithms will be shown. It doesn't matter how smart our ML algorithm is. If the UI hides its results or does not give the user the ability to give some form of feedback, all our efforts on the modeling side will have been in vain.

Also, it is important to understand that a change in the user interface might require a change in the algorithms and vice versa. Just as we learned before that there is an intimate connection between features and models, there is also another to be aware of between the algorithms and the presentation layer.

7. Data and Models are great. You know what is even better? The right evaluation approach.

Fig 8. Offline/Online Innovation Approach

This is probably one of the most important of the lessons in this post. Actually, as I write this I feel that it is a bit unfortunate that this lesson might seem as "just another lesson" hidden in position 7. This should be a good place to stress that these lessons in this post are not sorted from more to less important, they are just grouped in topics or themes.

So, yes, as important as all the other discussions about data, models, and infrastructure may be, they are all rather useless if you don't have the right evaluation approach in place. If you don't know how to measure an improvement you might be endlessly spinning your wheels without really getting anywhere. Some of the biggest gains I have seen in practice have indeed come from tuning the metrics to which models were being optimized.

Ok, then what is the "right evaluation approach"? Figure 8 illustrates an offline/online approach to innovation that should be a good starting point. Whatever the final goal of your machine learning algorithm is in your product you should think of driving your innovation in two distinct ways: offline and online.

Fig 9. Offline Evaluation

First, you should generate datasets that allow to try different models and features in an offline fashion by following a traditional ML experimentation approach (see Figure 9): You train your model to a training seat, you probably optimize some (hyper)parameters to a validation set, and finally measure some evaluation metrics on a test set. The evaluation metrics in our context are likely to be IR metrics such as precision and recall, ROC curves, or ranking metrics such as NDCG, MRR, or FPC (Fraction of Concordant Pairs). Note though that the selection of the metric itself has its consequences. Take a look at Figure 10 for an example of how the different ranking metrics weight different ranks being evaluated. In that sense, metrics such as MRR or (especially) NDCG will give much more importance to the head of the ranking, while FPC will be weighting more on the middle of the ranks. The key here is that depending on your application you should choose the right metric.

Fig. 10. Importance given to different ranks by typical ranking metrics

Offline experimentation is great because once you have the right data and the right metric it is fairly cheap to run many experiments with very few resources. Unfortunately, a successful offline experiment can only be generally used as an indication of a promising approach worth testing online. While most companies are investing in finding better correlation between offline and online results, this is still, generally speaking, an unsolved issue that deserves more research (see this KDD 2013 paper, for example).

In online experimentation the most usual approach is to do A/B testing (other approaches such as Multiarmed Bandit Testing or Interleaved Testing are becoming more popular recently but are beyond the scope of this post). The goal of an A/B test is to measure difference in metrics across statistically identical populations that each experience a different algorithm. As with the offline evaluation process, and perhaps even more here, it is very important to choose the appropriate evaluation metric to make sure that most if not all decisions on the product are data driven.

Most people will have a number of different metrics they are tracking in any AB test, but it is important to clearly identify the so-called Overall Evaluation Criteria (OEC). This should be the ultimate metric used for product decisions. In order to avoid noise and make sure the OEC maps well to business success it is better to use a long-term metric (e.g. customer retention). Of course, the issue with that is that you need time, and therefore resources, to evaluate a long-term metric. That is why it is very useful to have short-term metrics that can be used as initial early reads on the tests in order to narrow down worthwhile hypothesis that need to wait until the OEC read is complete.

If you want more details on the online experimentation piece there are many good reads, starting with the many good articles by Bing's Ronny Kohavi (see this, for example).

8. Distributing algorithms? Yes, but at what level?

There always comes a time in the life of a Machine Learning practitioner when you feel the need to distribute your algorithm. Distributing algorithms that require of many resources is a natural thing to do. The issue to consider is at what *level* does it make sense to distribute.

We distinguish three levels of distribution:

Level 1. For each independent subset of the overall data
Level 2. For every combination of the hyperparameters
Level 3. For all partitions in each training dataset

In the first level we may have subsets of the overall data for which we need to (or simply can) train an independently optimized model. A typical example of this situation is when we opt for training completely independent ML models for different regions in the world, different kinds of users, or different languages. In this case, all we need to do is to define completely independent training datasets. Training can then be fully distributed requiring no coordination or data communication.

In the second level, we address the issue of how to train several models with different hyperparameter values in order to find the optimal model. Although there are smarter ways to do it, let's for now think of the worst-case grid search scenario. We can definitely train models with different values of the hyperparameters in a completely distributed fashion, but the process does require coordination. Some central location needs to gather results and decide on the next "step" to take. Level 2 requires data distribution, but not sharing since each node will use a complete replica of the original dataset and the communication will happen at the level of the parameters.

Finally, in level 3 we address the issue of how to distribute or parallelize model training for a single combination of the hyperparameters. This is a hard problem, but there has been a lot of research put into it. There are different solutions with different pros and cons. You can distribute computation over different machines splitting examples or parameter using, for example, ADMM. Recent solutions such as the Parameter Sever promise to offer a generic solution to this problem. Another option is to parallelize on a single multicore machine using algorithms such as Hogwild. Or, you can use the massive array of cores available in GPU cards.

As an example of the different approaches you can take to distribute each of the levels, take a look at what we did in our distribution of Artificial Neural Networks over the AWS cloud (see Figure 11 below for an illustration). For Level 1 distribution, we simply used different machine instances over different AWS regions. For Level 2 we used different machine in the same region and a central node for coordination. We used Condor for cluster coordination (although other options such as StarCluster, Mesos, or even Spark) are possible. Finally, for level 3 optimization, we used highly optimized CUDA code on GPUs.

Fig 11. Distributing ANN over the AWS cloud

9. It pays off to be smart about your Hyperparameters

As already mentioned in the previous lesson, one of the important things you have to do when building your ML system is to tune your hyperparameters. Most, if not all, algorithms will have some hyperparameters that need to be tuned: learning rate in matrix factorization, regularization lambda in logistic regression, number of hidden layers in a neural network, shrinkage in gradient boosted decision trees... These are all parameters that need to be tuned to the validation data.

Many times you will face situations in which models need to be periodically retrained and therefore hyperparameters need to be at least fine-tuned. This is a clear situation where you need to figure out a way to automatically select the best hyperparameters without requiring a manual check. As a matter of fact, having an automatic hyperparameter selection approach is worthwhile even if all you are doing is the initial experimentation. A fair approach is to try all possible combinations of hyperparameters and pick the one that maximizes a given accuracy metric on the validation set. While this is, generally speaking, a good idea, it might be problematic if implemented directly. The issue is that blindly taking the point that optimizes whatever metric does not take into account the possible noisiness in the process and the metric. In other words, we can't be sure that if point A has an accuracy that is only 1% better than point B, point A is a better operating point than B.

Take a look at Figure 12 below, which illustrates this issue by showing (made up) accuracy results for a model given different values of the regularization parameter. In this particular example the highest accuracy is for no regularization, plus there is a relatively flat plateau region for values of lambda between 0.1 and 100. Blindly taking a value of lambda of zero is generally a bad idea since it points to overfitting (yes, this could be checked by using the test dataset). But, beyond that, going to the "flat region", is it better to stick with the 0.1 value? By looking at the plot I would be inclined to take 100 as the operating point. This point is (a) non-zero, and (b) noise-level different in terms of accuracy from the other non-zero values. So, one possible rule of thumb to use is to keep the highest non-zero value that is noise level different in terms of the optimizing metric from the optimal point.

Fig 12. Example of model accuracy vs. regularization lambda

I should also add that even though in this lesson I have dsf about using a brute-force grid search approach to hyperparameter optimization, there are much better things you can do which are again beyond the scope of this post. If you are not familiar with Bayesian Optimization, start with this paper or take a look at Spearmint or MOE.

10. There are things you can do Offline and there are things you can't... and there is Nearline for everything in between

In the lessons so far we have talked about the importance of data, models, UI, metrics... In this last lesson I thought it was worth to focus on systems and architecture. When the final goal of your ML model is to have impact on a product, you are necessarily going to have to think about the right system architecture.

Figure 13 depicts a three level architecture that can be used as a blueprint for any machine learning system that is designed to have a customer impact. The basic idea is that it is important to have different layers in which to trade off latency vs. complexity. Some computations need to be as real-time as possible to quickly respond to user feedback and context. Those are better off in an online setting. On the other extreme, complex ML models that require large amounts of data and lengthy computations are better done in an offline fashion. Finally, there is a Nearline world where operations are not guaranteed to happen in real-time but a best effort is performed to do them as "soon as possible".

Fig 13. This three level architecture can be used as a blueprint for machine learning systems that drive customer impact.

Interestingly, thinking about these three "shades of latency" also helps breaking down traditional machine learning algorithms into different components that can be executed in different layers. Take matrix factorization as an example. As illustrated in Figure 14, you can decide to do the more time-consuming item factor computation in an offline fashion. Once those item factors are computed, you can compute user factors online (e.g. solving a closed-from least squares formulation) in a matter of milliseconds in an online fashion.

Fig 14. Decomposing matrix factorization into offline and online computation

If you are interested in this topic take a look at our original blog post in the Netflix tech blog.

Conclusions

The ten lessons in this post illustrate knowledge gathered from building impactful machine learning and general algorithmic solutions. If I had to summarize them in 4 short take away messages those would probably be:

Be thoughtful about your data
Understand dependencies between data and models
Choose the right metric
Optimize only what matters

I hope they are useful to other researchers and practicioners. And, would love to hear about similar or different experiences in building real-life machine learning solutions in the comments. Looking forward to the feedback.

Acknowledgments

Most of the above lessons have been learned in close collaboration with my former Algorithms Engineering team at Netflix. In particular I would like to thank Justin Basilico for many fruitful conversations, feedback on the original drafts of the slides, and for providing some of the figures in this post.

Original video and slides

10 Lessons Learned from Building Machine Learning Systems from Xavier Amatriain

Introduction to Recommender Systems: A 4-hour lecture

2014-08-04T22:19:00.001-07:00

A couple of weeks ago, I gave a 4 hour lecture on Recommender Systems at the 2014 Machine Learning Summer School at CMU. The school was organized by Alex Smola and Zico Kolter and, judging by the attendance and the quality of the speakers, it was a big success.

This is the outline of my lecture:

Introduction: What is a Recommender System
“Traditional” Methods

Collaborative Filtering
Content-based Recommendations

"Novel" Methods

Learning to Rank
Context-aware Recommendations

Tensor Factorization
Factorization Machines

Deep Learning
Similarity
Social Recommendations

Hybrid Approaches
A practical example: Netflix
Conclusions
References

You can access the slides in Slideshare and the videos in Youtube, but I thought it would make sense to gather both here and link them together.

Here are the slides:

Recommender Systems (Machine Learning Summer School 2014 @ CMU) from Xavier Amatriain

Here is the first session (2 hours):

Here is the second session (2 hours):

Blog posts and Summer gigs

2014-06-20T14:03:00.002-07:00

I have recently heard complaints that this blog is rather quiet lately. I agree. I have definitely been focused on publishing through other sources and have found little time to write interesting things here. On the one hand, I find twitter ideal for communicating quick and short ideas, thoughts, or pointers. You should definitely follow me there if you want to keep up to date. On the other hand, I have published a couple of posts on the Netflix Techblog. A few months ago we published a post describing our three-tier system architecture for personalization and recommendations. More recently we described our implementation of distributed Neural Networks using GPUs and the AWS cloud.

The other thing I continue on doing very often is give talks of our work at different events and venues. In the last few months, for instance, I have given talks at LinkedIn, Facebook, and Stanford.

This week I gave a talk and attended the Workshop on Algorithms for Modern Massive Datasets (MMDS). This is a very interesting workshop organized by Michael Mahoney every two years. It brings together a diverse crowd of people, from theoretical physicist and statisticians to industry practicioners. All of them are united by their work on large scale data-driven algorithms. You can find the slides of my presentation here.

So, what is next? If you want to catch some of my future talks, I will be giving a couple of public ones in the next few months.

First, I will be lecturing in the Machine Learning Summer School (MLSS) at CMU in early July. I am really looking forward to joining such a great least of speakers and visiting Pittsburgh for the first time. I will be lecturing on Recommendation Systems and Machine Learning Algorithms for Collaborative Filtering.

Late August I will be giving a 3 hour long Tutorial at KDD in New York. The tutorial is entitled "The Recommender Problem Revisited" and I will be sharing stage with Bamshad Mobasher.

Finally, I was recently notified that a shorter version of the same tutorial has been accepted at Recsys, which this year is held in the Silicon Valley.

I look forward to meeting many of you in any of these events. Don't hesitate to ping me if you will be attending.

Recommendations as Personalized Learning to Rank

2013-07-23T11:50:00.000-07:00

As I have explained in other publications such as the Netflix Techblog, ranking is a very important part of a Recommender System. Although the Netflix Prize focused on rating prediction, ranking is in most cases a much better formulation for the recommendation problem. In this post I give some more motivation, and an introduction to the problem of personalized learning to rank, with pointers to some solutions. The post is motivated, among others, by a proposal I sent for a tutorial at this year's Recsys. Coincidentally, my former colleagues in Telefonica, who have been working in learning to rank for some time, proposed a very similar one. I encourage you to use this post as an introduction to their tutorial, which you should definitely attend. The goal of a ranking system is to find the best possible ordering of a set of items for a user, within a specific context, in real-time. We optimize ranking algorithms to give the highest scores to titles that a member is most likely to play and enjoy.

If you are looking for a ranking function that optimizes consumption, an obvious baseline is item popularity. The reason is clear: on average, a user is most likely to like what most others like. Think of the following situation: You walk into a room full of people you know nothing about, and you are asked to prepare a list of ten books each person likes. You will get $10 for each book you guess right. Of course, your best bet in this case would be to prepare identical lists with the "10 most liked books in recent times". Chances are the people in the room is a fair sample of the overall population, and you end up making some money. However, popularity is the opposite of personalization. As I explained in the previous example, it will produce the same ordering of items for every member. The goal becomes is to find a personalized ranking function that is better than item popularity, so we can better satisfy users with varying tastes. Our goal is to recommend the items that each user is most likely to enjoy. One way to approach this is to ask users to rate a few titles they have read in the past in order to build a rating prediction component. Then, we can use the user's predicted rating of each item as an adjunct to item popularity. Using predicted ratings on their own as a ranking function can lead to items that are too niche or unfamiliar, and can exclude items that the user would want to watch even though they may not rate them highly. To compensate for this, rather than using either popularity or predicted rating on their own, we would like to produce rankings that balance both of these aspects. At this point, we are ready to build a ranking prediction model using these two features.

Let us start with a very simple scoring approach by choosing our ranking function to be a linear combination of popularity and predicted rating. This gives an equation of the form score(u,v) = w1 p(v) + w2 r(u,v) + b, where u=user, v=video item, p=popularity and r=predicted rating. This equation defines a two-dimensional space as the one depicted in the following figure.

Once we have such a function, we can pass a set of videos through our function and sort them in descending order according to the score. First, though, we need to determine the weights w1 and w2 in our model (the bias b is constant and thus ends up not affecting the final ordering). We can formulate this as a machine learning problem: select positive and negative examples from your historical data and let a machine learning algorithm learn the weights that optimize our goal. This family of machine learning problems is known as "Learning to Rank" and is central to application scenarios such as search engines or ad targeting. A crucial difference in the case of ranked recommendations is the importance of personalization: we do not expect a global notion of relevance, but rather look for ways of optimizing a personalized model.

As you might guess, the previous two-dimensional model is a very basic baseline. Apart from popularity and rating prediction, you can think on adding all kinds of features related to the user, the item, or the user-item pair.Below you can see a graph showing the improvement we have seen at Netflix after adding many different features and optimizing the models.

The traditional pointwise approach to learning to rank described above treats ranking as a simple binary classification problem where the only input are positive and negative examples. Typical models used in this context include Logistic Regression, Support Vector Machines, Random Forests or Gradient Boosted Decision Trees.

There is a growing research effort in finding better approaches to ranking. The pairwise approach to ranking, for instance, optimizes a loss function defined on pairwise preferences from the user. The goal is to minimize the number of inversions in the resulting ranking. Once we have reformulated the problem this way, we can transform it back into the previous binary classification problem. Examples of such an approach are RankSVM [Chapelle and Keerthi, 2010, Efficient algorithms for ranking with SVMs], RankBoost [Freund et al., 2003, An efficient boosting algorithm for combining preferences], or RankNet [Burges et al., 2005, Learning to rank using gradient descent].

We can also try to directly optimize the ranking of the whole list by using a listwise approach. RankCosine [Xia et al., 2008. Listwise approach to learning to rank: theory and algorithm], for example, uses similarity between the ranking list and the ground truth as a loss function. ListNet [Cao et al., 2007. Learning to rank: From pairwise approach to listwise approach] uses KL-divergence as loss function by defining a probability distribution. RankALS [Takacs and Tikk. 2012. Alternating least squares for personalized ranking] is a recent approach that defines an objective function that directly includes the ranking optimization and then uses Alternating Least Squares (ALS) for optimizing.

Whatever ranking approach we use, we need to use rank-specific information retrieval metrics to measure the performance of the model. Some of those metrics include Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), or Fraction of Concordant Pairs (FCP). What we would ideally like to do is to directly optimize those same metrics. However, it is hard to optimize machine-learned models directly on these measures since they are not differentiable and standard methods such as gradient descent or ALS cannot be directly applied. In order to optimize those metrics, some methods find a smoothed version of the objective function to run Gradient Descent. CLiMF optimizes MRR [Shi et al. 2012. CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering], and TFMAP [Shi et al. 2012. TFMAP: optimizing MAP for top-n context-aware recommendation], optimizes MAP in a similar way. The same authors have very recently added a third variation in which they use a similar approach to optimize "graded relevance" domains such as ratings [Shi et. al, "Gapfm: Optimal Top-N Recommendations for Graded Relevance Domains"]. AdaRank [Xu and Li. 2007. AdaRank: a boosting algorithm for information retrieval] uses boosting to optimize NDCG. Another method to optimize NDCG is NDCG-Boost [Valizadegan et al. 2000. Learning to Rank by Optimizing NDCG Measure], which optimizes expectation of NDCG over all possible permutations. SVM-MAP [Xu et al. 2008. Directly optimizing evaluation measures in learning to rank] relaxes the MAP metric by adding it to the SVM constraints. It is even possible to directly optimize the non-diferentiable IR metrics by using techniques such as Genetic Programming, Simulated Annealing [Karimzadehgan et al. 2011. A stochastic learning-to-rank algorithm and its application to contextual advertising], or even Particle Swarming [Diaz-Aviles et al. 2012. Swarming to rank for recommender systems].

As I mentioned at the beginning of the post, the traditional formulation for the recommender problem was that of a rating prediction. However, learning to rank offers a much better formal framework in most contexts. There is a lot of interesting research happening in this area, but it is definitely worth for more researchers to focus their efforts on what is a very real and practical problem where one can have a great impact.

Tools for Data Processing @ Webscale

2013-06-09T11:19:00.002-07:00

A couple of days ago, I attended the Analytics @Webscale workshop at Facebook. I found this workshop to be very interesting from a technical perspective. This conference was mostly organized by Facebook Engineering, but they invited LinkedIn, and Twitter to present, and the result was pretty balanced. I think the presentations, though biased to what the 3 "Social Giants" do, were a good summary of many of the problems webscale companies face when dealing with Big Data. It is interesting to see how similar problems can be solved in different ways. I recently described how we address at Netflix many of these issues in our Netflix Techblog. It is also interesting to see how much sharing and interaction there is nowadays in the infrastructure space, with companies releasing most of what they do as open source, and using - and even building upon - what their main business competitors have created.

These are my barely edited notes:

Twitter presented several components in their infrastructure. They use Thrift on HDFS to store their logs. They have now build Twitter Parquet, a columnar storage database that improves storage efficiency by allowing to read columns at a time.

@squarecog talking about Parquet

They also presented their DAL (Data Access Layer), built on top of HCatalog.

Of course, they also talked about Twitter Storm, which is their approach to distributed/nearline computation. Every time I hear about Storm it sounds better. Storm now supports different parts of their production algorithms. For example, the ranking and scoring of tweets for real-time search is based on a Storm topology.

Finally, they also presented a new tool called Summingbird. This is still not open sourced, but they are planning on doing so soon. Summingbird is a DSL on top of Scalding that allows to define workflows that integrate offline batch processing from Hadoop and near-line from Storm.

LinkedIn also talked about their approach to combining offline/near-line/real-time computation although I always get the sense that they are much more leaning towards the former. They talked about three main tools: Kafka, their publish subscribe system; Azkaban, a batch job scheduler we have talked about using in the past; and Espresso a timeline-consistent NOSQL database.

Facebook also presented their whole stack. Some known tools, some not so much. Facebook Scuba is a distributed in-memory stats store that allows them to read distributed logs and query them fast. Facebook Presto was a new tool presented as the solution to get fast queries out of Exabyte-scale data stores. The sentence "A good day for me is when I can run 6 Hive queries" supposedly attributed to a FB data scientist stuck in my mind ;-). Morse is a different distributed approach to fast in-memory data loading. And, Puma/ptail is a different approach to "tailing" logs, in this case into HBase.

Another Facebook tool that was mentioned by all three companies is Giraph. (To be fair, Giraph was started at Yahoo, but Facebook hired the creator Avery Ching). Giraph is a graph-based distributed computation framework that works on top of Hadoop. Facebook claims they ran a Page Rank on a graph with a trillion edges on 200 machines in less than 6 minutes/iteration. Giraph is another alternative to Graphlab. Both LinkedIn and Twitter are using it. In the case of Twitter, it is interesting to hear that they now prefer it to their own in-house (although single-node) Cassovary. It will be interesting to see all these graph processing tolls side by side in this year's Graphlab workshop.

Another interesting thread I heard from different speakers as well as coffee-break discussions was the use of Mesos vs. Yarn or even Spark. It is clear that many of us are looking forward to the NextGen Mapreduce tools to reach some level of maturity.

10 "Little" lessons for life that I learned from running

2013-01-15T22:50:00.000-08:00

(Sorry for allowing myself to depart from the usual geeky computer science algorithmic talk in this blog. I owed it to myself and my biggest hobby to write a post like this. I hope you bear with me.)

Around 3 years ago, I smoked, I was overweight, and only exercised occasionally. Being a fan of radical turns in my life, I decided one day to go on a week-long liquid diet, I stopped smoking, and I took up running, with the only goal in my mind to some time run the half marathon in my home town. Little did I know that the decision to run would change my life in so many ways. This last year 2012, I have run 3 marathons, 4 half marathons, and a 199 mile relay with a team of 12. But, beyond that, I am convinced that I owe part of my personal and professional success these past years to the fact that I am a runner.

This post is my little homage to running and to the many lessons I have found in my journey.

When I started running I had lots of problems. The main one was due to an old knee injury that hit back on me. I had an ACL surgery when I was 16, and ever since my right knee has not been the same. When my knee started hurting this time, I visited several doctors, some specialized in sports. All of them recommended I should give up running. Some told me straight out that I would never be able to run a marathon. It took me lots of visits to the chiropractor, and lots of quads exercises over months to get back to running. But, I overcame these initial hurdles, and went into running not one but several marathons.

Lesson 1. Beginnings are hard: Starting anything new in life will be hard. You will need to invest lots of energies, and at times you will want to give up. The more important and significant the change is, the more it will take from you.

After I finished my first marathon in Santa Cruz, and when I thought all my knee problems were long gone, my knee started hurting again. This was nothing like what I had experienced when starting. Still, it could have been enough to stop me from trying again. But, it didn't. I focused on recovering. Soon I was back on the road.

Lesson 2. There will be ups and downs: Once you have overcome the initial difficulties in starting something new, you will be tempted to think that everything else should be easy . But, life, like most running courses, will have hills with ups and downs.

It is hard to wake up at 6 am for the morning run. It is easy to stay in bed when your legs are still sore from yesterday's training. It is tough to go out running when it is raining or freezing outside. It is even harder to decide not to stop when you hit the wall on mile 20 of a marathon. All these day to day small decisions end up adding up and making the difference between you improving and accomplishing your running goals.

Lesson 3. The importance of those small decisions: The small day to day decisions play a huge role in building your character. They will end up determining your long term success and the direction your life takes.

When you are not at your best, it is even harder to face all these small decisions I mentioned. If you are down for some time because of an injury, it is tough to start again on your own. Having a group of friends that share your passion for running is extremely important. I am fortunate to have a large group of friends that push me to become better, and help me get up when I fall.

Lesson 4. You are not alone - the power of social influence... and friends: Whatever new adventure you start in life, it is important to have people around you that understand and support it. People that share your passion can make a difference when you need it.

As much as I have appreciated having that extra support from friends and other fellow runners, there are many times I have felt the pressure of having to make a decision on my own. Many of those small decisions such as getting up off bed on a rainy day, for example. Nobody is going to make them for you. I have also felt alone in many of my training runs. And, of course, in mile 20 of a marathon, when everyone is giving their best but you can only see strangers around you. In all those moments it is important to be strong and be ready to carry on, on your own.

Lesson 5. But, you will be alone: No matter how many friends support you, you will have to face important decisions on your own, and carry your own weight.

It is well known that "repetition leads to mastery". This is even more so for activities that require developing physical strength and resistance. There is no other secret to becoming a better runner than to run, and run often. Putting on more miles is the goal. Everything else will come.

Lesson 6. Repeat, repeat, repeat, repeat: Repetition is the key to mastering most things in life. If you want to become good at doing something, ask yourself how you can invest thousands of hours in it (read about the 10k hour rule in Malcolm Gladwell's Outliers).

As much as repetition is needed to improve, it is hard to do so without a goal in mind. During my time running I have learned the power of having concrete goals. Setting up goals that are achievable in the long run, but not too easy to get to. As I have progressed, I have learned to be more demanding. My current goals is to do a 3:30 marathon, and a 1:30 half. The first one is achievable, the second one will need much more work. But these goals will keep me going and focused for some time.

Lesson 6. Set your goals: Setting ambitious but achievable goals in life will help you push harder and will keep you focused and looking forward.

When I look back at the way I started running, I realize how many things I did wrong. I have learned so much since then. I have read books, watched movies and online videos, talked to people that know much more than I do. I have also learned from looking at the data that I generate from each of my trainings. I have also learned to listen to and understand my body I am fortunate enough that I love learning, and I have enjoyed every bit of this learning experience.

Lesson 7. Data and knowledge: Use all the information around you to improve your life. Data about you can give you insights into how to become better. And any knowledge you gain from external sources can make a difference when taking a decision.

One of the reasons why beginnings are hard (Lesson 1) is that people that start running tend to overdue it by, for example, increasing distance and pace at the same time. This typically leads to injury, and frustration. One of the most important things to learn when starting to run is to understand your own limitations. Even when you do, you will be tempted to push to hard by continuing to run when your leg hurts, or by doing one too many races in a short period. I have done all of the above. But it is important to remember that everyone has their limits and forcing beyond them can result in long term problems.

Lesson 8. Everyone has their limits: Pushing yourself hard is good. However, there is such a thing as pushing *too* hard. You need to understand where your limits are to push them further, but only little by little.

No matter how hard it can get at some points, no matter how long it can take you, there is no doubt that you can do whatever you set your mind to. I don't have any special conditions for running, and I have never had. I don't think I will ever be a "great" runner. However, now I look back and laugh when I remember my unreachable goal a little over 3 years ago was "only" to run a half marathon. If someone like me, with little or no pre-existing conditions, family and work obligations, and very little time, can do it, so can you.

Lesson 9. But, yes you can: No matter how low you fall or how far your goal is, you can do it. Only think about the many people just like you who have done it before (e.g. Estimates are that around 0.1 to 0.5% of US Population has completed a Marathon). Why should you be any less?

As a conclusion, let me stress that the fact anyone can run does not mean that running is easy, and it requires no effort. It is precisely the fact that it is hard and requires effort for a long period of time what makes it worthwhile. Like most good things in life.

Lesson 10. All good things come hard: Think about it, all worthy things in life require effort and dedication. Being healthy, fit, happy, having a career, or a family, they all require your energy and long time investment. Just go with it, and enjoy every bit of the journey.

Recsys 2012: A long (and likely biased) summary

2012-09-17T10:47:00.000-07:00

After a great week in beautiful and sunny Dublin (yes, sunny), it is time to look back and recap on the most interesting things that happened in the 2012 Recsys Conference. I have been attending the conference since its first edition in Minnesota. And, it has been great to see the conference mature to become the premiere event for recommendation technologies. I can't hide that this is my favorite conference for several reasons: perfect size, great community, good involvement from industry, and good side program of tutorials, workshops, and demos.

This year I arrived a bit late and missed the first day of tutorials, and first day of the conference. But, was able to catch up after jumping right in with my 90 minute tutorial on "Building Industrial-scale Real-world Recommender Systems"

In my tutorial (see slides here), I talked about the importance of four different issues in real-world recommender systems:

Paying attention to user interaction models that support things like explanations, diversity, or novelty.
Coming up with algorithms that, beyond rating prediction, focus on other aspects of recommendation such as similarity, or, in particular, ranking.
Using results of online A/B tests, and coming up with offline model metrics that correlate with the former.
Understanding the software architectures where your recommender system will be deployed.

I was happy to see that some of these issues not only were mentioned, but almost became conducting threads throughout the conference. Of course, this might be in the eye of the beholder, and others might have come back with the impression that the main topics were others (I recommend you read these two other Recsys 2012 summaries by Daniel Tunkelang and Daniele Quercia). In any case, grouping in topics will help me summarize the many things I found interesting.

Online A/B Testing and offline metrics

I am glad to see that this has become a relevant topic for the conference, because many of us believe this is one of the most important topics that need to be addressed by both industry and academia. One of these people is Ron Kohavi, who delivered a great keynote on "Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics", where he described his learnings of many years of AB Testing in Amazon and Microsoft. It is funny that I cited his KDD 2012 paper in two slides in my tutorial, not knowing that he was in the audience. I recommend you go through his slides, it was one of the best talks of the conference for sure.

The importance of finding relevant metrics was, as a matter of fact, the focus of a workshop we organized with Harald Steck (Netflix), Pablo Castells (UAM), Arjen de Vries, and Christian Posse (LikedIn). The title of the workshop was "Recommendation Utility Evaluation: Beyond RMSE". Unfortunately, I was not able to attend. But, I do know the keynote by Carlos Gomez-Uribe, also from Netflix, was very well received. And, the workshop as a whole went very well with several interesting papers and even more interesting discussions. You can access the papers on the website.

A couple of papers in the main track of the conference also touched upon the importance of optimizing several objectives at the same time. In "Multiple Objective Optimization in Recommender Systems", Mario Rodriguez and others explain how they design LinkedIn recommendations by optimizing to several objectives at once (e.g. candidate that is good for the job + who is open to new opportunities). They report results from an AB Test run on LinkedIn. In "Pareto-Efficient Hybridization for Multi-Objective Recommender Systems", Marco Tulio Ribeiro and others from Universidade Federal de Minas Gerais & Zunnit Technologies take the multi-objective a step further. In their case, they optimize the system to not only be accurate, but also present novel or diverse items.

Some other papers went beyond the academic experimental procedure and implemented real systems that were tested with users. A good example is "Finding a Needle in a Haystack of Reviews: Cold Start Context-Based Hotel Recommender System" by researchers from the Tel Aviv Yaffo College and Technicolor.

Learning to Rank

Another hot topic in this year's recsys was ranking (or Top-n Recommendations as some prefer to call it). It is good to see that after some time publicly speaking about the importance of ranking approaches, the community seems now to be much more focused on ranking than on rating prediction. Not only there was a whole session devoted to ranking, but actually many other papers in the conference dealt with the topic in some way or another.

I will start by mentioning the very good work by my former colleagues from Telefonica. Their paper "CLiMF: Learning to Maximize Reciprocal Rank with Collaborative Less-is-More Filtering" won the best-paper award. And, I think most of us thought that it was very well-deserved. It is a very good piece of work. Well motivated, evaluated, and, it addresses a very practical issue. It is great to see the Recsys team at Telefonica that I started be acknowledged with this award. You can access the paper here and the slides here.

In that same session, researchers from the Université Paris 6 presented "Ranking with Non-Random Missing Ratings: Influence of Popularity and Positivity on Evaluation Metrics", an interesting study on the very important issue of negative sampling, and popularity bias in learning to rank. The paper discusses these effects on the AUC (Area Under the Curve) measure, a measure that is not very well-behaved, nor very much used in evaluating ranking algorithms. Still, it is a valuable first step in a very interesting line of work. It is interesting to point out that the CLiMF paper addressed the issue of negative sampling in a radically different way: only considering positive samples. Yet another interesting paper in that session was "Sparse Linear Methods with Side Information for Top-N Recommendations", a model for multidimensional context-aware learning to rank.

Another ranking paper, "Alternating Least Squares for Personalized Ranking" by Gábor Takács from Széchenyi István University and Domonkos Tikk from Gravity R&D, received an honorable mention. The main author coined an (un)popular sentence during his presentation when he invited anyone not interested in Mathematics to leave the room. An unnecessary invitation in a conference that prides itself for being inclusively multidisciplinary. In Recsys, psychologists are seating through systems presentations as much as mathematicians are seating through user-centric sessions, and that is what makes the conference appealing. In any case, the paper presents an interesting way to combines a ranking-based objective function introduced in last year's kdd and the use of ALS instead of SGD to come up with another approach to learning to rank.

Two papers dealing with recommendations in Social Networks also focused on ranking. "On Top-k Recommendation Using Social Networks" by researchers from NYU and Bell Labs, and "Real-Time Top-N Recommendation in Social Streams" by Ernesto Diaz-Aviles and other researchers from the University of Hannover. The same first author had an interesting short paper in the poster session: "Swarming to Rank for Recommender System". In that poster he proposes the use of a Particle Swarm Optimization algorithm to directly optimize ranking metrics such as MAP. The method proposes an interesting alternative to the use of Genetic Algorithms or Simulated Annealing for this purpose.

Finally, the industry keynote by Ralf Herbrich from Facebook, also introduced the world of Bayesian Factor Models for large-scale distributed ranking. This method, introduced by the same author and others from MSR as "Matchbox" is now used in different settings. For example, the poster "The Xbox Recommendation System" presented its applicability for recommending movies and games for the Xbox. And, in "Collaborative Learning of Preference Rankings" the authors apply it to... sushi recommendation!

User-centric, interfaces & explanations

This was probably the third big area of focus of the conference, with many contributions in papers, tutorials, and workshops. The first day, there were actually two tutorials that would fall into this category. In "Personality-based Recommender Systems: An Overview", the authors presented the idea of using personality traits for modeling user profiles. Among other things, they introduced their proposal to use PersonalityML, an XML-based language for personality description. Interestingly, in the industry session, we saw that this is actually a quite practical thing to do. Thore Graepel from Microsoft explained their experiments in using The Big Five personality traits for personalization. In the other tutorial, "Conducting User Experiments in Recommender Systems", Bart Knijnenburg gave a thorough overview of how to conduct user studies for recommender systems. He also introduced his model for using structural equations to model the effects to evaluate. Again, I missed this tutorial, but I was fortunate to hear a very similar presentation by him in Netflix.

In "Inspectability and Control in Social Recommenders", Bart himself (and researchers from UCSB) analyze the effect of giving more information and control to users in the context of social recommendations. A similar idea is explored in the short paper "The Influence of Knowledgeable Explanations on Users' Perception of a Recommender System" by Markus Zanker.

Two papers addressed the issue of how much information we should require from users. In "User Effort vs. Accuracy in Rating-Bbased Elicitation" Paolo Cremonesi and others analyze how many ratings "are enough" for producing satisfying recommendations in a cold-start setting. And, in "How Many Bits Per Rating?", the Movielens crew try to quantify the amount of information and noise in user ratings from an information-theoretical perspecive. An interesting continuation to my work on user ratings noise. However, as the first author himself admited, this is just initial work.

Other highlights of user-centric work that fell more on the UI side were the paper "TasteWeights: A Visual Interactive Hybrid Recommender System" by my friends at UCSB, as well as the many papers presented in the Workshop on Interfaces for Recommender System.

Data & Machine Learning Challenges

If somebody thought that data and machine learning challenges would fade away after the Netflix Prize, this year's Recys was a clear example that this is far from being the case. Many challenges have taken over after that: the yearly KDD Cups, Kagel, Overstock, last year's MoviePilot challenge, the Mendeley Challenge... Recsys had this year a Tutorial/Panel and a Workshop on Recommender Systems Challenges, both organized by Alan Said, Domonkos Tikk, and others. I could not attend the Tutorial since it was happening at the same time than mine. But, I was able to catch some interesting presentations in the Workshop. Domonkos Tikk from Gravity R&D gave a very interesting presentation on how they evolved from being a team in the Netflix Prize to a real-world company with very interesting projects. Kris Jack from Mendeley also gave two interesting talks on the Mendeley recommender systems. In one of them, he explained how they make use of AWS and Mahout in a system that can generate personalized recommendations for about $60 a month. In the other, he talked about their perspective on data challenges.

Context-aware and location-based recommendations

This has become a traditional area of interest in Recsys. It has now matured to a point that it has it own session, and two workshops: "Personalizing the Local Mobile Experience", and the "Workshop on Context-Aware Recommender Systems". But, besides having its own session in the conference, several other papers in others also deal with context-aware recommendations. I have already mentioned "Sparse Linear Methods with Side Information for Top-N Recommendations", for example. Other interesting papers in this area were "Context-Aware Music Recommendation Based on Latent Topic Sequential Patterns", on the issue of playlist generation, and "Ads and the City: Considering Geographic Distance Goes a Long Way" for location-aware recommendations.

Social

A similar area that has already matured over several Recsys is Social. It has its own session, and Workshop, "Workshop on Recommender Systems and the Social Web" , and trancends over many other papers. In this area, the paper that I have not mentioned in other categories and found interesting was "Spotting Trends: The Wisdom of the Few". One of the reasons I found the paper interesting is because it builds on our idea of using a reduced set of experts for recommendations, what we called "The Wisdom of the Few".

Others

And yes, I still have some interesting stuff from the poster session that I could not fit into any of the above categories.

First, the short paper "Using Graph Partitioning Techniques for Neighbour Selection in User-Based Collaborative Filtering" by Alejandro Bellogin. Alejandro won the Best Short Paper Award, for a great piece of work and presentation. He described an approach to use the Normalized Cut graph clustering approach for grouping similar users, and improve neighborhood formation in standard kNN Collaborative Filtering.

I also liked the poster "Local Learning of Item Dissimilarity Using Content and Link Structure", another graph-based approach, in this case to learn a similarity function.

Finally, in "When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection and Combination", Michael Ekstrand starts to tap into an extremely important question: when and why do some recommendation algorithms fail? This question has been informally discussed in the context of hybrid recommenders and ensembles. But, there is clearly much more work to do, and many things to understand.

----------------------

Well, if you made it all the way to here, it means that you are really interested in Recommender Systems. So, chances are that I will be seeing you in next year's Recsys. Hope to see you in Hong Kong!

Netflix @ Recsys 2012

2012-09-05T10:18:00.001-07:00

We are just a few days away from the 2012 ACM Recommender Systems Conference (#Recsys2012), that this year will take place in Dublin, Ireland. Over the years, Recsys has become my favorite conference because of its unique blend of academic research and industrial applications. If you are not familiar with the conference, you might get a flavor by reading my report from last year. Needless to say that it is also dear to my heart because of my involvement as General Cochair in its 2010 edition in Barcelona.

If you had to mention a single company that is identified with recommender systems and technologies, that would probably be Netflix. The Netflix Prize started a year before the first Recsys conference in Minneapolis, and it impacted Recommender Systems researchers and practitioners in many ways. So, it comes as no surprise that the relation between the conference and Netflix also goes a long way. Netflix has been involved in the conference through the years. And, this time in Dublin is not going to be any different. Not only Netflix is a proud sponsor of the conference, but you will have the chance to listen to presentations and meet some of the people that make the wheels of Netflix recommendations turn. Here are some of the highlights of Netflix' participation:

Both Harald Steck and myself are involved in organizing the workshop on "Recommender Utility Evaluation: Beyond RMSE". We believe that finding the right evaluation metrics is one of the key issues for recommender systems. This workshop will be a great event to not only discover latest research in the area, but also brainstorm and discuss on the issue of recsys evaluation. Unfortunately, I will miss the workshop because of my traveling schedule. But, Harald will be representing Netflix on the organization side.
On that same workshop, you should not miss the keynote by our Director of Innovation Carlos Gomez-Uribe. The talk is entitled "Challenges and Limitations in the Offline and Online Evaluation of Recommender Systems: A Netflix Case Study". Carlos will give some insights into how we deal with online A/B and offline experimental metrics.
On Tuesday, I will be giving a 90 minute tutorial on "Building industrial-scale real-world Recommender Systems". In this tutorial, I will talk about all those things that matter in a recommender system, and are usually outside of the academic focus. I will describe different ways that recommendations can be presented to the users, evaluation through AB testing, data, and software architectures. I look forward to seeing you all there.

Besides Harald, Carlos and myself, you should also look forward to meeting other members of the personalization team at Netflix. Rex Lam, Justin Basilico, and Kelvin Jiang, will also be attending the conference. We are all looking forward to meeting old and new friends and interacting with the Recsys community during the conference. If you want to make sure we meet, feel free to send me an email (first name initial+last name at netflix.com) or contact through twitter.

More data or better models?

2012-07-02T23:22:00.000-07:00

The discussion of whether it is better to focus on building better algorithms or getting more data is by no means new. But, it is really catching on lately. This was one of the preferred discussion topics in this year's Strata Conference, for instance. And, I do have the feeling that because of the Big Data "hype", the common opinion is very much favoring those claiming that it is "all about the data". The truth is that data by itself does not necessarily help in making our predictive models better. In the rest of this post I will try to debunk some of the myths surrounding the "more data beats algorithms" fallacy.

The Unreasonable Effectiveness of a Misquote

Probably one of the most famous quotes defending the power of data is that of Google's Research Director Peter Norvig claiming that "We don’t have better algorithms. We just have more data.". This quote is usually linked to the article on "The Unreasonable Effectiveness of Data", co-authored by Norvig himself (you should probably be able to find the pdf on the web although the original is behind the IEEE paywall). The last nail on the coffin of models is when Norvig is misquoted as saying that "All models are wrong, and you don't need them anyway" (read here for the author's own clarifications on how he was misquoted).

The effect that Norvig et. al were referring to in this article, had already been captured years before in the famous paper by Microsoft Researchers Banko and Brill [2001] "Scaling to Very Very Large Corpora for Natural Language Disambiguation". In that paper, the authors included the plot below.

That figure shows that, for the given problem, very different algorithms perform virtually the same. however, adding more examples (words) to the training set monotonically increases the accuracy of the model.

So, case closed, you might think. Well... not so fast. The reality is that both Norvig's assertions and Banko and Brill's paper are right... in a context. But, they are now and again misquoted in contexts that are completely different than the original ones. But, in order to understand why, we need to get slightly technical. I don't plan on giving a full machine learning tutorial in this post. If you don't understand what I explain below, read Andrew Ng's Practical Advice for Machine Learning. Or, better still, enroll in his Machine Learning course.

Variance or Bias?

The basic idea is that there are two possible (and almost opposite) reasons a model might not perform well.

In the first case, we might have a model that is too complicated for the amount of data we have. This situation, known as high variance, leads to model overfitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features, and... yes, by increasing the number of data points. So, what kind of models were Banko & Brill's, and Norvig dealing with? Yes, you got it right: high variance. In both cases, the authors were working on language models in which roughly every word in the vocabulary makes a feature. These are models with many features as compared to the training examples. Therefore, they are likely to overfit. And, yes, in this case adding more examples will help.

But, in the opposite case, we might have a model that is too simple to explain the data we have. In that case, known as high bias, adding more data will not help. See below a plot of a real production system at Netflix and its performance as we add more training examples.

So, no, more data does not always help. As we have just seen there can be many cases in which adding more examples to our training set will not improve the model performance.

More features to the rescue

If you are with me so far, and you have done your homework in understanding high variance and high bias problems, you might be thinking that I have deliberately left something out of the discussion. Yes, high bias models will not benefit from more training examples, but they might very well benefit from more features. So, in the end, it is all about adding "more" data, right? Well, again, it depends.

Let's take the Netflix Prize, for example. Pretty early on in the game, there was a blog post by serial entrepreneur and Stanford professor Anand Rajaraman commenting on the use of extra features to solve the problem. The post explains how a team of students got an improvement on the prediction accuracy by adding content features from IMDB.

In retrospect, it is easy to criticize the post for making a gross over-generalization from a single data point. Even more, the follow-up post references SVD as one of the "complex" algorithms not worth trying because it limits the ability of scaling up to larger number of features. Clearly, Anand's students did not win the Netflix Prize, and they probably now realize that SVD did have a major role in the winning entry.

As a matter of fact, many teams showed later that adding content features from IMDB or the like to an optimized algorithm had little to no improvement. Some of the members of the Gravity team, one of the top contenders for the Prize, published a detailed paper in which they showed how those content-based features would add no improvement to the highly optimized collaborative filtering matrix factorization approach. The paper was entitled "Recommending New Movies: Even a Few Ratings Are More Valuable Than Metadata".

To be fair, the title of the paper is also an over-generalization. Content-based features (or different features in general) might be able to improve accuracy in many cases. But, you get my point again: More data does not always help.

The End of the Scientific Method?

Of course, whenever there is a heated debate about a possible paradigm change, there are people like Malcolm Gladwell or Chris Anderson that make a living out of heating it even more (don't get me wrong, I am a fan of both, and have read most of their books). In this case, Anderson picked on some of Norvig's comments, and misquoted them in an article entitled: "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete".

The article explains several examples of how the abundance of data helps people and companies take decision without even having to understand the meaning of the data itself. As Norvig himself points out in his rebuttal, Anderson has a few points right, but goes above and beyond to try to make them. And the result is a set of false statements, starting from the title: the data deluge does not make the scientific method obsolete. I would argue it is rather the other way around.

Data Without a Sound Approach = Noise

So, am I trying to make the point that the Big Data revolution is only hype? No way. Having more data, both in terms of more examples or more features, is a blessing. The availability of data enables more and better insights and applications. More data indeed enables better approaches. More than that, it requires better approaches.

In summary, we should dismiss simplistic voices that proclaim the uselessness of theory or models, or the triumph of data over these. As much as data is needed, so are good models and theory that explains them. But, overall, what we need is good approaches that help us understand how to interpret data, models, and the limitations of both in order to produce the best possible output.

In other words, data is important. But, data without a sound approach becomes noise.

(Note: this post is based on a few slides included in my Strata talk a few months back. Those slides sparked an interesting debate, and follow up emails that prompted me to write these lines)

Beyond the 5 Stars Round up

2012-04-16T22:56:00.005-07:00

Last week, I published a post on the Netflix tech blog. The post, entitled "Netflix Recommendations: Beyond the 5 stars" describes how recommendations have evolved at Netflix since the Netflix Prize. If you haven't read the post, do so now.

The post stirred quite some interesting reactions and many comments. Here is a list of some of the ones I picked up, grouped by "category":

Our post got picked up by the usual content aggregators:

Hacker News posted a link to the post, and as always attracted a great number of comments
Likewise, Reddit also attracted some interesting comments.

Some online media gave a summary of our post:

Engadget was the first one to have a brief piece with several interesting comments: Netflix explains its recommendation system, can't find a reason for Adam Sandler's last movie
The CNN also had a longer article, with a detailed summary of our post: Inside Netflix's popular 'recommendation' algorithm
Econsultancy reflects on our post by calling us Netflix: the algorithm company

Others focused on the figure we gave on the post related to the percentage of views coming from recommendations.

The Verge picked up on this for their piece: Netflix offers details on its recommendation engine, says it guides 75 percent of viewership
Also PC Magazine focused on this issue, but gave a more detailed summary of the post: 75 Percent of Netflix Viewing Based on Recommendations
Finally, Business Insider, also focused on this issue in their Netflix's Recommendation Engine Drives 75% Of Viewership

The last few reactions I picked up, focused on the reasons why we are not using the winning entry to the Netflix Prize .

The Next Web, entitled their piece Remember Netflix’s $1m algorithm contest? Well, here’s why it didn’t use the winning entry
Techdirt focused only exclusively on this issue in their Why Netflix Never Implemented The Algorithm That Won The Netflix $1 Million Challenge
Both the previous articles are ok, but a bit misleading in that they seem to imply that not implementing the final solution to the prize might have been a loss for Netflix. But, the prize to the worst uninformed and ill-intentioned reaction to our post has to go to Forbes' What The Failed 1m Netflix Prize Tells Us About Business Advice. It should be clear to anyone that the Netflix Prize was a huge success in all metrics. And, Netflix had already recovered the $1m long before the competition ended.

Recsys 2011 - Notes and Pointers

2011-11-02T21:29:00.000-07:00

I found Recsys this year of very high quality in general. There were many good papers and presentations. The Industry track was also very high-quality, with very interesting talks from companies such as Twitter, Facebook, or eBay. Jon Sanders and I also gave two presentations explaining how recommendations have evolved since the Netflix Prize (more on this soon).

Here are my rough notes with pointers to some papers I considered especially interesting. I have grouped them in 5 categories that I think summarize the main topics in the conference: (1) Transparency and explanations, (2) Implicit Feedback, (3) Context, (4) Metrics and evaluation, and (5) Others. Note that the selection is completely biased towards my personal interests.

(1) TRANSPARENCY & EXPLANATIONS. One of the recurring themes was the fact that user trust and perceived quality of the recommendations was very much influence not by accuracy alone, but by how transparent the system was, and the amount of "explanations" that were added.

Daniel Tunkelang(LinkedIn) did a very interesting tutorial on "Recommendations as a Conversation with the User", where he focused on these kinds of issues. See his slides in his blog.
Neel Sundaresan (eBay) also stressed in his keynote that adding explanations can sometimes be more important than getting the recommendation right.
In the paper "Each to His Own: How Different Users Call for Different Interaction Methods in Recommender Systems", the authors found that depending on how experts are users in the domain, they prefer different kind of recommendations and interaction models. For example, in one of the extremes, novices, prefer top-10 non-personalized to their personalized recommendations. In general a hybrid model of interaction is better than either implicit or explicit-only.

(2) IMPLICIT FEEDBACK. A lot of papers this years on using implicit consumption data instead of (or in combination with) ratings.

The best paper, by Yehuda Koren and Joe Sill, addressed the issue of non-linearity in ratings. "OrdRec: An Ordinal Model for Predicting Personalized Item Rating Distributions" modifies the standard Matrix Factorization approach to adapt to the fact that user ratings are ordinal, but not numerical. The way they model ratings, with a set of thresholds, can be used in combination with any model, not only SVD-like approaches. This paper effectively addresses most of the issues I raised in my previous post "We are doing everything wrong..."
In "Modeling Item Selection and Relevance for Accurate Recommendations: A Bayesian Approach" they define the concept of a "Free probabilstic model" where they try to predict independently the probabilty of play and rating.
In "Multi-Value Probabilistic Matrix Factorization for IP-TV Recommendations", the authors present a Matrix Factorization model that allows for multiple observations of the same item. In particular, it is applied for IPTV recommendations where the fact that the user watched part of an episode is interpreted as negative feedback.
"Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback" presents a combined Matrix Factorization model that includes ratings, content features, and implicit feedback. They use cosine item similarity for weighing negative examples.
In "Personalizing Tags: A Folksonomy-like Approach for Recommending Movies", they use tags (or categories) as a very simple method of recommending movies: for each user compute average rating given to movies with a certain tag.

(3) CONTEXT. There were 2 workshops (CARS and CAMRA), and several papers in the main conference, talking about how to add contextual information for the recommendations:

"The Effect of Context-Aware Recommendations on Customer Purchasing Behavior and Trust" is an interesting paper, focusing on the evaluation side. They include an A/B test for measuring the effect of context-aware recommendations. Using context increased overall sales in $ but not in number. Therefore, users tend to spend more $ per item.
In the CAMRA workshop, many papers (such as "Temporal Rating Habits: A Valuable Tool for Rater Differentiation" or "Identifying Users From Their Rating Patterns") were related to how to identify who the author of a rating in a household was, since this was one of the tasks for the contest.
Also related to group recommendations, "Group Recommendation using Feature Space Representing Behavioral Tendency and Power Balance among Members", tries to model what is a good recommendation for a group where each of the individuals does not have the same influence.

(4) METRICS and EVALUATIONS: There were several papers that offered different ways to measure accuracy for top-N ranked recommendations.

"Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems" presents an interesting framework that includes metric for measuring not only accuracy, but also novelty, diversity....
"Item Popularity and Recommendation Accuracy" is an interesting work on how to remove popularity bias from accuracy metrics. A user study validates the fact that recall measure is correlated with user perceived quality of recommendation. Besides proposing a recall metric that removes popularity bias, he also proposes a popularity stratified training method that weights negative examples according to how popular they are.
"Evaluating Rank Accuracy based on Incomplete Pairwise Preferences" proposes a measure called expected discounted rank correlation for the specific case of implicit feedback.

(5) OTHERS

eBay and UCSC presented "Utilizing Related Products for Post-Purchase Recommendation in E-commerce". The paper won the best poster award
There were many papers on Social Recommendations. Just to name one, in "Power to the People: Exploring Neighbourhood Formations in Social Recommender Systems", they did a user study to figure out how much users would like and trust recommendations coming from different user groups (those they decided, friends, everyone...). Interestingly, the method of choice did not make much difference... until you told the users what it was.
In "Wisdom of the Better Few: Cold Start Recommendation via Representative based Rating Elicitation" they discussed how to select most imformative users and items for cold start. I was surprised to see that our "Wisdom of the Few" approach got paraphrased in a paper title.
There were a couple of very interesting workshops on Music Recommendations and Mobile Recommendations that I had to miss since I was attending others. But, they are definitely worth looking into if you are into music or mobile.

The Recommender Problem & the Presentation Context

2011-09-25T23:41:00.001-07:00

In the traditional formulation of the "Recommender Problem", we have pairs of items and users and user feedback values for very few of those dyads. The problem is formulated as the finding of a utility function or model to estimate the missing values.

In many real-world situations, feedback will be implicit** and binary in nature. For instance, in a web page you will have users visiting a url, or clicking on an add as a positive feedback. In a music service, a user will decide to listen to a song. Or in a movie service, like Netflix, you will have users deciding to watch a title as an indication that the user liked the movie. In these cases, the recommendation problem becomes the prediction of the probability a user will interact with a given item. There is a big shortcoming in using the standard recommendation formulation in such a setting: we don't have negative feedback. All the data we have is either positive or missing. And the missing data includes both items that the user explicitly chose to ignore because they were not appealing and items that would have been perfect recommendations but were never presented to the user.

A similar issue has been dealt with in traditional data mining research, where classifiers need to be trained only using positive examples. In the "Learning Classifiers from Only Positive and Unlabeled Examples" SIGKDD 08 paper, the authors present a method to convert unlabeled examples into both a positive and a negative example, each with a different weight related to the probability that a random exemplar is positive or negative. Another solutions to this issue is presented in the "Collaborative Filtering for Implicit Feedback Datasets" paper by Hu, Koren and Volinsky. In this work, the authors binarize the implicit feedback values: any feedback value greater than zero means positive preference, while any value equal to zero is converted to no preference. A greater value in the implicit feedback value is used to measure the "confidence" in the fact the user liked the item, but not in measuring "how much" the user liked it. Yet another approach to inferring positive and negative feedback from implicit data is presented in the paper I co-authored with Dennis Parra, and I presented in a previous post. There, we argue that implicit data can be transformed to positive and negative feedback if aggregated at the right level. For example, the fact that somebody listened only once to a single track in an album can be interpreted as the user not liking that album.

In many practical situations, though, we have more information than the simple binary implicit feedback from the user. For unlabeled examples that the user did not directly interact with, we can expect to have other information. In particular, we might be able to know whether they were shown to the user or not. This adds very valuable information, but slightly complicates the formulation of our recommendation problem. We now have three different kinds of values for items: positive, presented but not chosen, and not presented. And this is only if we simplify the model. In reality, information related to the presentation can be much richer than this and we might be able to derive data like the probability the user actually saw the item or weigh in different interaction events such as mouse overs, scrolls...

In Netflix, we are working on different ways to add this rich information related to presentations and user interaction to the recommender problem. That is why I was especially interested in finding out that this year's SIGIR best student paper award has been awarded to a paper that addresses this issue. In the paper "Collaborative Competitive Filtering: Learning
Recommender Using Context of User Choice", the authors present an extension to traditional Collaborative Filtering by encoding into the model not only the collaboration between similar users and items, but also the competition of items for user attention. They derive the model as an extension to standard latent factor models by taking into account the context in which the user makes the decision. That is, the probability I decide to select a given item depends on which are the other items I have as an alternative. Results are preliminary but promising. And, this work is definitely an interesting and appealing starting point for an area with many practical applications.

However, there are many possible improvements to the model. One of them, mentioned by the authors, is the need to take into account the so-called position bias. An item that is presented in the first position of a list has many more possibilities to be chosen than one that is farther down. This effect is well-known in the search community and has been studied from several angles. I would recommend, for instance to read some of the very interesting papers on this topic by Thorsten Joachims and his students. In the paper “How Does Clickthrough Data Reflect Retrieval Quality?”, for instance, they show how arbitrarily swapping items in a search result list has almost no effect. This proves that the positioning of the element can be a most important factor than how relevant the item is.

I would love to hear of other ideas or approaches to deal with this new version of the recommender problem that includes, and would encourage researchers in the area to address an issue of huge potential impact.

**Note: I am using the word implicit here in the traditional sense in the recommendation literature. The truth is that a user selecting an item is in fact explicit information. However, it can be considered implicit in that the user is informing about the preferences indirectly by comparing the item to others in a context.

Joining Netflix

2011-07-28T21:49:00.000-07:00

Three weeks ago, I started to work for Netflix. Everything has moved so fast with so many things to do and learn that it seems like I have already been here for a much longer time!

I am now working as the manager of a small team working on recommendations & personalization in the company that promoted recommender systems research to major headlines thanks to the Netflix Prize. It also feels great to come to the company in an exciting time when it has just reached its 25th million customer and is starting its international expansion to Latin America.

All the fuzz created around the Netflix Prize might lead some to believe that rating prediction is all there is to Netflix suggesting a given movie. However, I was happy to find out that rating prediction is only one of the many signals that my team uses in creating the final suggestions.

Awesome place, awesome people, and awesome time to be around. And, btw, we are hiring, so let me know if you are interested in joining. (Update: it seems that the jobs link is currently not active outside US/CA... I'm working on getting this fixed)

Walk the Talk: On the Combination of Implicit and Explicit Feedback

2011-07-18T22:32:00.000-07:00

Last week, Denis Parra presented our paper entitled "Walk the Talk: Analyzing the Relation between Implicit and Explicit Feedback for Preference Elicitation" at the UMAP conference. The paper won Denis the best-student paper award (Congratulations!).

The paper presents our initial work in analyzing the relation between implicit and explicit feedback. In short, the main question we wanted to answer is how does the self-reported preferences users give in a typical 5-star interface relate to what they actually do when looking at their consumption patterns. Our hypothesis was that there should exist simple models that relate both kinds of feedback. Finding a way to robustly convert implicit feedback into explicit ratings would open up the door to applying well-known methods with implicit feedback. But, much more importantly, we could then combine both kinds of input in a single model.

In order to test our hypothesis, we prepared an experiment in the music domain. We asked last.fm users to take a survey in which we queried them about how much they liked albums that were already in their listening history. With this data in hand, we could analyze the relation between implicit and explicit feedback and try to fit a simple model.

I recommend you read the full paper if you want to get the longer story of our findings, but here is a brief summary:

There is a strong correlation between implicit feedback and self-reported preference (see figure below)
Variables such as recentness of interaction or overall popularity do not have significant effect. Note that in a previous study by Salganik & Duncan Watts, global popularity was found to affect users perceived quality. However, in that case and as opposed to ours, users were made aware of the popularity.
Interaction effect: When listening to music, some people prefer to listen to isolated songs or albums. The way they interact with music, affects the way they report their taste.

After our analysis, we then construct a linear model that takes into account these variables by performing a linear regression. Once we have built these models, we can evaluate their performance in a regular recommendation scenario by measuring the error in predicting ratings in a hold-out dataset.

This paper represents an initial but very promising line of work that we have already improved in several ways such as the use of logistic instead of linear regression to account for the non-linearity of the rating scale or the use of the regression model as a way to combine both implicit and explicit feedback. But I will leave those findings for a future post.

Recommender Systems: We're doing it (all) wrong

2011-04-07T13:59:00.000-07:00

A few days back, there was an interesting post by Judy Robertson in the Communications of the ACM blog. The post, entitled "Stats: We're doing it wrong", builds upon a paper from last year's CHI conference in which they report that more than 90% of the HCI researchers used the wrong statistical tools when analyzing and reporting on likert scale type of data. A Likert scale is a unidimensional scale on which the respondent expresses the level of agreement to a statement - typically in a 1 to 5 scale in which 1 is strongly disagree and 5 is strongly agree.

Here is an excerpt from the post that I think is worth highlighting:

Likert scales give ordinal data. That it (sic), the data is ranked "strongly agree" is usually better than "agree." However, it's not interval data. You can't say the distances between "strongly agree" and "agree" would be the same as "neutral" and "disagree," for example. People tend to think there is a bigger difference between items at the extremes of the scale than in the middle (there is some evidence cited in Kaptein's paper that this is the case). For ordinal data, one should use non-parametric statistical tests which do not assume a normal distribution of the data. Furthermore, because of this it makes no sense to report means of likert scale data--you should report the mode.

As Judy, I have to admit that I am not a stats expert myself either. But in the general case I would agree with the previous: likert scale data is ordinal and cannot be treated as interval. However, whether treating it as interval is always a mistake or can be accepted under some circumstances is something that I am not sure and relates to the rest of this post.

So for instance, it is not uncommon to find references where they clearly state that likert data can be treated as interval. For example, look at what they say in this handbook edited by the FAO.

Likert scales are treated as yielding Interval data by the majority of marketing researchers.

Or look at the answer to the question of whether likert data can be treated as interval in stackexchange.

So there might be some circumstances in which, depending on the analysis, likert could be treated as interval... I guess. But not in the general case.

Implications for Recommender Systems

Now onto the big question: What does this have to do with Recommender Systems and how does it affect? To start with, let me ask you the question: Does the likert (1 to 5) scale relate to anything we use in recommender systems? You got it: ratings !

So our worry goes now to understanding whether ratings can be treated as interval or they should instead be treated as ordinal data, just as they are in the general case of the likert scale. In order to defend that ratings can be treated as interval, we should have some validation that the distance between different ratings is approximately equal. However, just as in the case of likert scales, we know this is not the case.

Look at this figure from our previous work on measuring noise in ratings.
Here we are plotting the probability of finding different kinds of inconsistencies between pairs of ratings. The probability that a user changes her rating between 2 and 3 is almost 0.35 while the probability she changes between 4 and 5 goes down to almost 0.1. This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5.

Consequences

At this point, we can safely say that ratings are ordinal but not interval data. However, they are treated as a continuous interval scale in most of the recommender systems research! Let us stop to think a few of the consequences of ratings not being interval data.

Distance Measures: All the neighbor based methods in collaborative filtering are based on the use of some sort of distance measure. The most commonly used are Cosine distance and Pearson Correlation. However, both these distances assume a linear interval scale in their computations! We should conclude that using these distance measures with rating data is wrong. Other measures such as Spearman's rank correlation, do not assume this. But to be honest, I don't remember having read many papers using Spearman.

Error Measures: This is my favorite one... The most commonly accepted measure of success for recommender systems is the Root Mean Squared Error (RMSE). But wait, this measure is explicitly assuming that ratings are also interval data! Similar error measures such as MAE also fall in the same trap... banned! So what could we use? Standard Information Retrieval measures such as Precision and Recall do not necessarily assume interval scale on the ratings, although their mapping to recommendation efficiency may also be questioned. Rank-based measures such as Discounted Cumulative Gain (nDCG) seem like our best bet for now.

Matrix factorization: Most MF techniques in Recommender Systems are in fact optimizing for RMSE. Therefore, we should discard them as statistically incorrect for the same reasons stated above. There are interesting alternatives to this though, like the PureSVD method presented in Recsys last year, that do not optimize for RMSE but rather for ranking.

Conclusion

It is clear that explicit ratings, just like likert scale data, have to be treated like ordinal (and not interval data). However, most of the methods and measures currently in use in recommender systems assume in some sense that there is a continuous linear scale in the ratings. Of course I am not advocating for throwing all of this research to the trash (among other things, it would include much of mine), but I would advice for a drastic change in the way we approach these issues.

I am writing this post, especially in the hope to get feedback and reactions from you. So I am looking forward to the comments.

Update: This post was featured in Ycombinator Hacker News. So far it has received over 6K views and there is a somewhat interesting comment thread in Ycombinator.

(I'd like to thank and acknowledge the contribution of Denis Parra, Alexandros Karatzoglou, and Rodrigo Oliveira to this post through previous very fruitful discussions)

The Science and the Magic of User Feedback

2011-03-18T08:25:00.000-07:00

That was the main title of a series of talks I gave in different labs and companies during my recent California tour. In this presentation, I talked about many of our recent projects related to how to interpret user feedback, in general, and in the particular case of recommender systems. I talked about our work on measuring user rating noise, our follow-up in devising algorithms to reduce this natural noise, and on how you can use experts instead of crowds to not only minimize this noise but address other issues in collaborative filtering.

I also gave a sneak preview of our results to the music survey I announced some time ago. Denis Parra and I have submitted this work recently and are hoping to get it accepted to tell you a bit more about how to map implicit to explicit feedback.

The Science and the Magic of User Feedback for Recommender Systems

Update: Thanks to the guys at LinkedIn's SNA group, I have now added below the video of my presentation at LinkedIn... enjoy!

Tech Talk: Xavier Amatriain (Telefonica) -- "The Science and Magic of User and Expert Feedback for Improving Recommendations" from Talks at LinkedIn on Vimeo.

Managing Research the Agile Way

2011-03-15T03:56:00.000-07:00

I have discussed previously on this blog about how well the Scientific Method adapts to Agile approaches. These ideas also took me to an unfinished effort to draft an Agile Research Manifesto. However, by talking to several people with similar ideas, I realized that these attempts were largely interpreted as an intellectual exercise with little practical application. It is clearly my fault for not having explained that all of this in reality comes from many practical experiences. Some of these experiences go back to my PhD years when managing the CLAM framework, as well as many undergrad student projects. As a matter of fact, during those days I published a practical guide for students on how to do their final project the "agile way" (I still keep the webpage, in catalan, for historical reasons).

In any case, in this post I wanted to address the practical side of agile research management by giving you a flavor of how I try to manage projects.

(Picture by ~Tone)

The anatomy of a research project

What am I talking about when I say a "research project"? Although they might be completely different in theme and even scope, all of the projects that I have in mind when explaining the agile management approach should share at least some of the following properties:

Small-sized team: It is very likely that we are dealing with a one or two researchers team. A 3-4 people research team can already be considered large in my experience.
Very open and imprecise requirements: Especially at the beginning, we might have a coarse idea or hypothesis to validate. However, the approach, method, and scope, are likely to be undecided until very late in the game.
High risk: By definition, a research project has to be highly innovative and therefore... risky. Our goal is to minimize the cost of a failure and realize early on but not to remove failure since this is an intrinsic feature of risk.
Imprecise resources: The fact that requirements are not clear and risk is high is usually accompanied by the fact that resources that can be allocated to the project are usually imprecise. If the project is highly successful and proves its interest in the first iterations, it can grow into something larger with more resources added to it. On the other hand, it is also very likely to be killed quickly if it does not yield promising initial results.

The planning game

I will usually start-off by devoting a couple of weeks to a Sprint 0 during which the main tasks will be:

Understand what has been done before: Obviously, this requires lots of reading. However, it is good practice to also start writing at this same time, maybe in an informal wiki or the like.

Define the tools: Unless you are in a very specific environment, tools are likely to change for every project. Sometimes it is not only about what is the best tool, but also about what the team is most familiar with. This is usually an important thing in most projects, but it is more so in a project that it is high risk in nature and should avoid spending lots of time/resources in adapting to new tools.

Define the initial scope: There is no way you can have a complete picture of what is going to be the output of the project by this time. However, you should be able to list what you think will be the main steps and even some findings you anticipate. This list should be written like an ever changing Product Backlog (prioritized list of high-level features).

(Picture by Jenna Carver)

Prioritizing

One of the most important activities that you end up doing when planning any project, be it at the initial phase or at any of its iterations, is prioritizing the different requirements, stories... Doing this in a group meeting is a great way to gain insights on the project and to be strategic. Prioritizing tasks is not much different from any cost/benefit analysis: you measure cost, you measure benefit, and then sort items according to benefit/cost ratio.

In the case of project planning, I usually like to assign cost to "complexity", and benefit to "interest". In other words, the cost of a feature or story will be how difficult or complex we anticipate it is to implement it. And the benefit is how interesting or important it is for our final goal. Once you sort items using the interest/complexity ratio, you will find that easy-to-do yet interesting features float to the top, while complex and not so important sink down to the bottom.

Of course, the interesting discussions happen right in the middle. And especially when we have something that seems to be very important, but also very complex to achieve. In these cases, we feel tempted to jump right away at the problem and devote 100% of our energy to it. However, one of the agile principles is that things seem more complex when you don't have enough understanding. If you put them off to later iterations, they will eventually become clearer and clearer and end up surfacing to the first positions on your priority list. I have found this sort of smart procrastination to be extremely useful for agile research management.

Iterate, Iterate, Iterate

Once you have come up with your initial product backlog, it is all a matter of breaking the process down in short iterations - I usually plan for one-week. At the beginning of each iteration (or Sprint), you look at your product backlog, pick some of the top stories and break them down into finer grain tasks. You do the prioritization game on this new list and come up with your next week's scrum/iteration backlog.

When doing this finer-grain prioritization, I have found it very useful to use the estimated number of hours as the measure of "complexity". Therefore, when picking the top tasks of our list, we will also have an estimate of how feasible it is to have them during this iteration and how much will be the relative effort put into each of them. And, if any task is estimated to be more than a day long, do yourself a favor and break it into several tasks.

Also, it is important that, especially during the first iterations, you realize that the continuation of the project might be at stake at each iteration (or at least the current approach). Therefore, when measuring the "importance" of tasks to prioritize, ask yourself how relevant will that task be to convince you and others that you are onto something or you need to change routes.

Test-driven research

If you are familiar with agile methods, you will probably know how important testing is in an agile project. Tests not only guarantee the stability of the project but are actually a way to specify requirements in a more verifiable form. In a similar way, you can think of specifying many of your research hypothesis as a test. For example, you can turn your hypothesis that the effect of a given procedure on your data or population is significant by a verifiable assertion that for t-test(D_original, D_after_procedure)-> p is smaller 5%. There are many hypothesis and research tasks that can - and should - be written in this form before making it to your prioritized todo list. At least you should worry on how any of your results will be validated and how you can trust for them to be consistent and significant.

(Picture by Dunechaser)

Related approaches

If you are interested in this kind of approaches, I recommend you read the article on the SCORE method, which is somewhat related to many of the things I am mentioning here. Here you can read an interesting paper on doing test-driven research. Finally, I find the Pomodoro method a very interesting approach to individual time management. Since many research projects end up being quasi-individual, Pomodoro fits them pretty well.

On Trust Networks and Gamification. Or How Quora can overcome its Hype and embrace long-term Success

2011-01-31T13:26:00.000-08:00

If you are reading this blog I am pretty sure that you know quite a lot about Quora by now. If not, you should sign on and try it a bit before you continue reading the post.

I have to admit it, the first time I saw Quora I thought it looked like a watered-down version of stackoverflow , only with a much broader scope. The ability to follow was nice but... "big deal", I thought. However, I was missing the important point of the seamless integration between Quora and existing OSN, namely Twitter and Facebook. I always say that for an OSN to succeed it needs to ride on all the previous successful ones (including email if you allow me to stretch the definition of OSN that far), but I missed that part in quora until its hype began. Having quick connection to Twitter and Facebook, allowed Quora to overcome the always feared cold-start problem. You sign to Quora and in no time you are "connected" to all your "friends" and can start following their questions, their answers, votes... cool!

Well, so it seemed. But in no time, just as quick as people starting hyping about the service they were complaining about it and predicting its failure. This recent post at Techcrunch does a pretty good at summarizing and linking to the main Quora bitchmemes. Don't miss the original post by Vivek Wadhwa or some of the threads at Quora itself. You should also read this very illustrative piece on how Scobble went from love to hate in a matter of weeks.

To summarize, the two biggest complaints are the following: (1) Quora will inevitably be overtaken by spam and there will be no way to find good content anymore; and (2) producers of good content (answers) will become tired of the system and progressively leave making problem (1) even more inevitable.

While I do agree that these (and many other) issues are very important, I don't see them as inevitable and, in the following paragraphs, I would like to describe two ways to address them. But to start with, let me just state that believing that Quora can survive on being an inside-moderated network is not the answer. So what can be done?

Trust networks

In a trust network, nodes (users) have an associated trust value that is somehow used to decide how its contribution will be taken into account by the rest of users. For instance, in a recommender system, I can push content by neighboring nodes I trust while filtering out that coming from nodes with a lower trust value. In more sophisticated versions, trust is not a unique value but can be topic-specific. That is, my trust value can be very high for independent music but very low for classic literature. (If you are interested in the general topic of Trust and Social Networks you can read Golbeck's book or any of her many publications or presentations available online)

So let's go back to Quora now: why should they implement a trust network overlay? and, how could they implement a useful one? There are several reasons for why they should be doing so. But let us focus on the spam issue. You do not want for bad answers to get promoted by bad/evil users. The way around it is to not give these users the power to promote answers. And you can do this quite easily by assigning trust values to users. It would take 100 votes by "level 1" users to get an answer to the level of another one with just one vote by a "level 100" user. Of course, as I was mentioning before, this trust level could be topic-sensitive. Makes sense, doesn't it?

But, there a number of issues that are still unsolved on how to implement this trust network. The first one is who decides to promote and demote users? My answer is quite simple: users themselves. Whenever your answer gets voted up/down so would your trust level. And again, how much this level would go up/down would depend on the trust level of the voting user.

The only important remaining issue to such an approach is how to deal with the cold-start issue. But the answer to this would come from the integration to other OSN I was mentioning at the beginning. If I were implementing this kind of system, I would give users an initial trust level based on their TunkRank or their Klout Score.

Gamification

The other major issue that still needs to be tackled is how we guarantee that users do not become tired of the system and abandon it. I hope it is clear by now that the approach I described above would make things much more interesting for users interested in promoting their trust level. In fact, this is very close to what is known as gamification (see also game dynamics or game mechanics for very related concepts). Attach a badge to given levels of trust for some topics and you can start competing with Foursquare check-ins.

The use of badges, or game dynamics in general in Q&A sites is by no means new. Actually, stackoverflow, that I was referring to earlier in the post, delivers topical badges. And obtaining the first badge on a given topic can be an important accomplishment worth noting in your resume. But stackoverflow did not come up with this idea out of the blue: levels of expertise in forums have been used for a long time (see the Coffee Cups/Beans in Ubuntu forums, for instance).

I am not saying that implementing these two approaches would guarantee Quora's success. But not implementing them will probably guarantee the opposite. We have all seen potential in Quora. Apart from the quick integration with existing OSN and the ability to follow, there is the real-time component that brings it closer to a Q&A Twitter. If they don't fix these potential issues, somebody else will come up with an improved version that can very well be the "next big thing" that some social media gurus were seeing in Quora just some weeks ago.

Did you prepare your talk?

2010-11-16T16:12:00.000-08:00

I don't consider myself to be a great presenter. As a matter of fact, every time I finish a presentation, I find myself thinking about how many things I screwed up and could have done much better. However, whenever I attend a conference I face the cruel reality: my presentations are way better than most research presentations. If I am really not that good, it can only mean one thing: researchers generally suck at presenting their work. (This is in fact one of the reasons I am against organizing research conferences around oral presentations. But this is another discussion I will leave for another post).

So, if you have any doubts of whether you could be in that category of good researcher/poor presenter, you can do a quick test: Watch the video below. If you think your last presentation is well summarized in the video, you definitely fit into the group. Even if you don't, you might find some tips or advice of interest to you in the rest of this post.

OK, so what are the three basic rules to make a decent presentation? Easy: (1) Prepare yourself, (2) prepare yourself, and (3) prepare yourself.

At this point, you might already be tempted to stop reading because you disagree with what I am saying. I have found several reasons why people disagree with something as obvious as the fact that making a good presentation requires preparation, but I think all of them are summarized in the two following:

(a) I'm a natural: Maybe you are the kind of self-assured person that thinks that has great presentation skills and those shine best the more you improvise. I was pretty close to this myself some time ago. But if you fit into this category, there is a very easy test you can do: tape yourself on video on several presentations. If you still think you are great and need no preparation or further skills, congratulations! But chances are that then you realize how many things you have been doing wrong and how much you can improve. All great presenters I know stress the fact that preparation is key, period.

(b) I'm a researcher, not a TV Star: On the other extreme, you might be aware of your limitations but might think that this is not such a big deal. You are a researcher and live in the world of formulas, theories, or code. You could care less about what people get from your talks and you would be happy standing up and doing the chicken, chicken, chicken presentation. And this is not an exaggeration: I have seen junior researchers that are still editing slides a couple of hours before their scheduled presentation in a top conference. My take on this is the following: if you think presentation skills are not part of what is required in a researcher, you are wrong.

So by this point I will suppose that you are convinced of the importance of preparing research presentations. Ideally, you have also taped yourself and found that there are many things to improve. The question is what to do next. Obviously, I cannot pretend to summarize a presentation skills course in a post. There are thousands of resources out there in the form of books, videos, or similar that you will find without problem. But I do think that I can pinpoint a few issues that are important and tricks that might help.

First, I think it is important to separate two kinds of "preparation": (1) mid/long-term preparation aimed at improving your skills, and (2) short-term preparation for your next presentation.

Improving your skills

Again, you can find many books and resources on how to do this. But some of the things that you should at least consider:

(1) Tape yourself:

This will make you aware of the weak points and where you need to focus your efforts

(2) Enjoy the stage:

Some people have a really hard time every time they go onto stage, and this shows. There are many things you can do to learn techniques and improve on this that go from playing in a band to taking some acting and performance lessons (I did this and found it very useful and enjoyable)

(3) Read about it:

No need to become obsessed. But reading a couple of books or watching some videos giving you tips is not going to hurt. And remember, this is part of your expected skill set as a researcher. If you want a starting point, I can recommend you read a short 12 page essay on "How to give an academic talk v4.0" by Paul N. Edwards from U. Michigan.

(4) Rehearse the techniques:

It is very good if you have situations where you can rehearse what you learn from the previous. Actually, many of the techniques can be applied in "real" life (e.g. when talking to your boss). Others require of a more realistic setting. I have been lucky to use the courses at the university as a rehearse playground for improving my skills

(5) If you need help, look for it:

I have seen many cases of researchers with severe communication problems when presenting. Maybe I sound too harsh here, but I don't think this is acceptable. If you really want to be a researcher but don't think you can get to an acceptable level of presenting either (a) have some co-author present for you or (b) find some professional help. And this latter would be my preferred option. It is not so hard nowadays to find coaches or places that can help you out and if you agree this is an ability that you need in your job (and, again, you should agree), it is worth that you invest on it.

Preparing your next presentation

Regardless of whether you manage to improve your general presentation skills or not, you will have to face your next presentation sooner or later. When preparing the talk, you should focus on its two main components: the slides, and the talk itself.

The slides

Again, there are many resources out there on how to prepare your slides, the style, the design.. I was lucky to attend a course on Zen presentation style by the guys at Presentaciones Artesanas. Zen-style presentations are the kind of slides you will see at TED, for instance. I try to bear some of the techniques in mine but (a) I am not a professional presenter (that is, although presentations are important in a researcher's life, I have other things to do), and (b) sometimes, transmitting scientific rigor in a very graphical style is not easy (actually, according to Tufte, even Powerpoint should be banned from scientific publications). However, I do recommend to understand some of the design concepts behind the Zen style and maybe use some of them as a basis.

Once you have found your style, you will need to do the following tasks:

(1) Know your audience:

Before you start preparing the presentation, take some time to understand who you will be talking to. It's not the same to do a talk at a conference than pitch your work to business people, present to a prospective employer, or, like I did last week, try to convince high-schoolers of how cool Computer Science is.

Even if you are only focusing on research presentations at conferences, they are not all the same! Sometimes you will be giving a talk in a setting where everybody is an expert in what you are talking about, while in other occasions only a tiny fraction of the audience is working in your same field. In my case, I won't use the same kind of approach if I am presenting at a Recsys conference where everybody knows about Recommender Systems than at a generic one like WWW, where I can only assume that most of the audience does not know the topic in depth.

It is also important to look at the program schedule. The name of your session and the talk immediately before and after yours is going to give you more information about who might be sitting in. If you are presenting in a conference with multiple tracks, the talks scheduled at the same time as yours will give you some hint about who is *not* going to be attending yours.

(2) Find "the message"

Find a simple take-away message that you want to get through to your audience. In many cases it will be something along the lines of "look how important and interesting my research is, please go ahead and look more into it by reading the paper... and don't forget to cite it in your next publication". But in other to transmit that idea you need to make your point. Therefore, find the answers to: (a) what problem does your work solve, (b) what makes your work different from other solutions, and (c) why should anybody care about it. These three questions should help you find the message. Stick to one/two ideas and refer the audience to the paper for more details. Trying to squeeze in too many messages in too little time is a recipe for disaster.

Some researchers like to add another secondary message thread: (d) I am really smart and what I did is so complicated you might not even grasp it... I particularly dislike this kind of presentations and find them pretentious and boring (maybe because I am not so smart). But hey, I know some people have made quite a career of this so you should be aware.

(3) Prepare a script

Once you have identified the "main message," you are ready to prepare the script of the slides. I usually start off by having a bunch of empty slides with only the title on them. By having this, I can see if I might be going over time, need to sort things out differently... The script will depend on the kind of talk and time you have to speak. But in general, it will have a structure such as:

Introduce context and situation
Formulate problem and why it is important to solve
Main message (Solution to the problem, consequences, details on the solution...)
Summary on problem and solution
Future work and things to do

The script is important, but be ready and willing to change it. You are likely not to get it perfect from the start and as soon as you start adding more detail you will see a clearer picture. Don't make the "sticking to the plan" hit you back.

(4) Make the visuals

Maybe you think this is the least important part of your presentation. In my experience, I have come to value the visuals very much. Actually, most of the time I spend in preparing some presentations is looking for appropriate visuals that back up and re-enforce the "main message". The less familiar the audience is with your topic, or the less hardcore researchy it is, the more time you will want to spend choosing appropriate visuals. Some well-chosen pictures will make your message more sticky. And you might find some images that are so powerful that might make you go back to your script and twist it a bit.

The talk

Once you have the slides more or less ready, you can start preparing the talk itself. Bear this in mind: if you prepare the slides but not the talk, your presentation is likely to suck. Some ideas and tips that can help you in the process:

(1) Tape yourself

As I mentioned before, taping yourself is one of the best tools I have found for improving your presentation skills. It is also an amazing tool for preparing your next talk. If you watch a video of yourself rehearsing the talk, you will be able to analyze what you are explaining wrong, where you are wasting your time, what jokes don't make sense... Besides, it is a perfect timing tool: not only you will get the exact duration of your talk in the video but also how you distributed it. This will allow you to make sure that you are devoting the right amount of time to getting the "main message" through.

(2) Test technical issues as many times as possible

Don't need to mention Murphy's law, I suppose, but if something can go wrong, it will. No matter how much I check things over and over, there are always technical issues that catch me by surprise.

In my last talk, I had a couple of videos that I knew could be problematic. I spent a lot of time in making sure they were working in the presentation. I even tried my laptop with a secondary monitor to make sure. I asked the host well in advance to make sure that I had the possibility of connecting the audio output of my laptop. And on the day, I went to the hall and tested the audio. I was even going to test the videos but the audience was already half in, so I preferred to keep the surprise and tested with some random music instead (big mistake here!). What happened? Videos did not show, so I had to improvise opening them with another program and that did not work very well either because of a limitation on the projector's resolution, I think.

In my case, I like to put myself in situations of technological risks and I like the feeling of doing a complex live demo in a presentation that I know can fail (I guess is like the adrenaline rush I used to have when playing in a concert). But I have to admit that the safest advice is to keep technical challenges as simple as possible. And be anal in checking many times those that you know are likely to fail.

Of course, checking technical details means, among other things, that you need to be in the room for your talk well in advance and test the presentation in the same conditions you are going to it later (even if that means missing one of the coffee breaks in the conference!).

(3) Be ready for improvising

And my last piece of advice may seem to contradict the rest. I have been talking about the importance of preparing many details of the talk. However, a presentation should always leave room for improvisation and adaptation. There is nothing worse than the feeling that the speaker has learned the conference by heart, and is not making any attempt to connect with the audience and the context. Besides, there might be elements during the talk that might force you to improvise: a technical issue, different audience that you expected, a reaction from somebody...

You should be able to put any kind of external element into your presentation while not losing the main message. I don't like having to skip slides since it gives the impression that you are in a rush to finish, but many times there is no alternative: You might have lost precious time in trying to play that video, or maybe went too far in the introduction and now you need to cut short. It is again very important that you have a clear picture of what the "main message" is and improvise by skipping those slides that are not needed to understand it.

I hope that some of this advice is useful in your next presentations. But I would like to hear from you: how do you prepare your talks? Any tips or suggestions you want to share in the comments?

Internship positions on Recommender Systems

2010-09-29T02:52:00.000-07:00

At Telefonica Research we are looking for young and talented researchers willing to expand their horizons by working in an exciting environment in beautiful Barcelona. And, as you will know if you follow this blog, I am particularly interested in working with PhD students whose research focus is Recommender Systems but also neighboring areas such as Data Mining, User Modeling, Social Networks, and Information Retrieval. We offer three month internships and interesting conditions.

Work from previous interns has been published in top conferences such as SIGIR, WWW, Recsys, Web Intelligence... (see my list of recent publications, most of which include interns)

And, if you want references of what an internship in Telefonica is like, you might want to contact some of our previous interns in the group:

Meeyoung Cha, currently Assistant Professor at KAIST
Haewoon Kwak, PhD student at KAIST
Neal Lathia, currently Researcher at UCL
Jae-wook Ahn, PhD student at U. Pittsburgh
Linas Baltrunas, PhD student at U. Bolzano
Denis Parra, PhD student at U. Pittsburgh
Miguel Ramirez, PhD student at U. Pompeu Fabra

Please contact me for more details.

Contextual Movie Recommendations on an iPhone based on Expert Collaborative Filtering

2010-09-23T15:53:00.000-07:00

If you follow this blog, you probably have already read about Expert Collaborative Filtering and The Wisdom of the Few. Maybe you also read about our recent implementation of the approach to recommend music. Well if you are around in Recsys 2010 conference next week you will get to see a demo of yet another prototype on Monday's Demo Session.

We are presenting an iPhone application based on the Expert Collaborative Filtering approach. The application is the result of Josep Bach's undergrad final thesis and you can read the full-blown description of the project in his dissertation. The application, however, is much more than yet another implementation of Expert CF. The main highlights for me is that (a) you can offer personalized recommendations on a phone with 100% privacy guarantees, and (b) you can run a recommendation algorithm on the device, with minimum intervention from the server-side.

Both these issues can be explained by the client-server architecture depicted below. The server is in charge of compiling all the public information available on the web by crawling critic websites like Rottentomatoes. It also gathers information about local cinemas and their schedules. All this information, which again is public, is stored in a SQL database and shared through a RESTful API with devices.

The device, in this case an iPhone but could be anything else, connects to the server and syncs a local database through the RESTful API. Once this is done, all needed information is local on the device. Plus... all the personal information about the user (i.e. ratings on movies in this case). The recommendation algorithm can then run locally and return results in a reasonable time because the set of experts is limited.

Another important addition to the application is that we have added contextual features. The recommendations you will get on the app depend on your location and the time of the day. Therefore, it will recommend things that match your taste according to the expert-based prediction but also are playing in a cinema nearby now.

We haven't done a full user evaluation yet, but informal results are very encouraging. We hope you can come and test it in Recsys and give us your feedback. We will soon post a video. But for now, here are some screenshots of the main app screens.

List of Recommendations given your preferences, critics ratings but also your location

Information on a movie, including critics ratings, your ratings, and also closest cinema that is playing with next show times

Screen showing cinemas near your current location (in blue)

Information on the closest cinema near you

The end of the Age of Search?

2010-09-15T13:26:00.000-07:00

A couple of days ago, there was an interesting article in the New York Times on how social networks are changing the search experience. The truth is that the article is a bit confusing and mixes up several different issues. As a matter of fact, most of the article ends up being an introduction to Hunch, a very interesting recommendation site (thus not "search") based on different technologies including social recommendations.

I twitted the link to the article and presented it as another prove of the "End of the Age of Search". That got me into a very interesting Buzz conversation with Greg Linden on why I thought that the age of search is coming to an end. I promised that I would try to write a more elaborate post to make my point if I had the time... and here I am.

The first time I read about the "end of the age of search" was in an article titled the race to create a smart Google at CNN Money. As a matter of fact, the discussion on how recommender systems were going to render search engines obsolete was cited by Recsys 2009 organizers and turned into their homepage motto.

So, credit given to Jeffrey O'Brien at CNN Money and Recsys09 organizers, I picked up on this idea and have been elaborating it in several of my presentations. Two years ago, for instance, I gave a presentation on Recommendations as the Future of Search in an open research day organized by our lab. The main story, that I have repeated several time since then, is the
following:

"
Think about it: Search is not an ultimate need for people. What people need is information. The fact that they have been using search and this has been so successful is (mainly) because that is the only tool we gave them.

Search by itself is not enough to compensate the ever-growing information overload. First, there is the issue that for most given queries, you will get many more results than a user can ever go through. So you are faced with the problem of how to turn that huge set into the "ten blue links" (i.e. first results page). But, there are more or less smart ways to do so by taking into account context, user preferences and so on.

The main issue, however, is a different one: every search action requires users to explicitly formulate a query. From our geek perspective, we usually forget how difficult it is for a regular user to formulate a query given an information need. Even if you are fairly proficient, it might be complicated to turn a fairly trivial information need into something you can formulate in a simple query (take a look at the experiment I did with some SIGIR attendees when I asked them to search for my daughter's name, which is actually written in my homepage).

So, of course, the bottom line of my story has "traditionally" been that Recommender Systems represent a step forward since (a) they provide ways to assess relevance taking into account personal preferences and context, and (b) they can provide results without the need for explicit queries. I still believe Recommender Systems will have much to say in the way search is handled in the future. Of course, I work on the area, so you might think my opinion is a bit biased. But you don't have to take my word for it. In this year's industry track at the SIGIR conference both Yahoo and Google mentioned "implicit search" as one of the most important trends. Now I hear that Google's Schmidt is talking about autonomous search. They are all different ways of talking of Recommender Systems (which maybe, as my friend @mramirez suggested, is not the sexiest name for a research/technology area).
"

But, we have recently learned some data that can be used as a supporting evidence of the end of the Age of Search. Nielsen just published some results that show a 16% drop in web searches over the last year. And this is something pretty symptomatic! I disagree with some of the comments in that same post by Nielsen saying that this drop is due to the use of mobile devices. Unfortunately, I cannot say that this is due to the huge success of recommender systems either. In my opinion, there is one main reason for this: Search is being replaced by Social Networks.

If you think about it, most information needs users have are not about a particular and concrete piece of information such as "who wrote War and Piece". They are actually much less precise needs such as "what is there to do this weekend" or "is there a cool music album I could listen to while I go to work tomorrow". Or even things like "what important stuff has happened in the world today" or "I need to find a job better than the one I have". If you consider information needs like this, you will realize that the answer is much more likely to come out of your social network than out of an artificially formulated query.

But not only that, I think we would agree that most of the time, when people go into the Internet, they don't have any information need beyond the prototypical "see what's up" or "catch up". And what do they do? They log into Facebook or check Twitter. It is clear that in this cases, search is out of the picture.

Yet another worrying trend for the future of search is the decrease in the use of the web browser and the increase of "walled internet gardens". This idea hit front lines last month with Chris Anderson's piece in Wired and has been covered throughout, so I won't go into it (but this is the main reason I am trying to avoid the use of the word "Web" in favor of "Internet" in this post).

As a finishing note, I don't want anybody to get the wrong impression that by talking about the end of search as the driver for web development, I am implying that search-oriented companies like Google is doomed. Of course not. Google know most of what is in this post as well as I do. As I mentioned, they are more and more talking about implicit or autonomous search as a proxy word for recommender systems. And in the case of social networks taking over search, I am pretty convinced they would agree with such a vision for the future. That is why they have been trying to get into the social scene so hard lately with Buzz, Wave... and more recently the rumors are that they are working in a Facebook killer called Google Me or at least they are looking into adding more and more social features into their search engine.

Interesting times to be doing research in this area because as Search supremacy comes to an end, we will have more space to fill in the void with newer and much cooler ideas.

As always... looking forward to your feedback.

Study on online music taste: call for participation

2010-08-25T01:55:00.000-07:00

Are you a music listener and lastfm user? Are you interested in helping out research while having the chance to win a $600 Amazon gift card? Please help us understand online music tastes by completing a survey that will only take around 15 minutes of your time and might even be fun!

All you need to do to participate is go to this page and provide your last.fm username and a valid email. We will check if you meet the requirements (at least 18 y.o. and 5000 scrobbles on lastfm) and we will then send you a link to your personalized survey.

Thanks for your time!

Multiverse Recommendations (aka using n-dimensional tensor factorization for context-aware collaborative filtering)

2010-08-08T14:29:00.001-07:00

This post is the first of several in which I will be explaining some of the things we are presenting in the upcoming Recsys 2010 conference. The project I will talk about is led by Alexandros Karatzoglou and presents a new approach to context aware recommendations that we have named Multiverse. You can access the full paper here, but I will give you a brief description in this post.

The introduction of context in recommender systems is an area of growing interest. The reason is simple: While we all value the fact that Recommender Systems are able to infer our tastes and recommend new things, it is clear that whatever we like - and are willing to receive - depends on the context. E.g. We do not want to receive the same movie recommendations on TV if we are sitting with the kids on a Sunday afternoon or if we are alone on a late night session. There is a growing body of literature on contextual recommendations. Without going any further, I already posted about context-aware recommendations with micro-profiles on this blog. Also, there is a very good chapter on the topic on the upcoming Recommender Systems Handbook. But, while we wait for it, you might want to look at some of the publications by Adomavicius and Tuzhilin.

Context takes the recommender problem from a two dimensional problem, where we have users and items, to an n-dimensional one where we can have many contextual dimensions added. In our work, we have generalized the successful matrix factorization approach to this n-dimensional case. In order to do this, we have used the idea of tensors, which are precisely a generalization of matrices to n dimensions. The following figure illustrates the idea (note that, for simplicity, we are illustrating the 3 dimensional case with just one contextual variable).

In the paper, we show how this approach outperforms previously existing methods on a number of different datasets. One of these results is illustrated in the figure below. Note how Tensor Factorization (in green) not only outperforms other methods, but it performs better the more contextual information we add. It is also interesting to note how not observing context information (black line) results in worse performance. When we add contextual information to 80% of our data, not using this information yields a result that is almost 50% worse.

The use of context in recommender systems and other areas of information retrieval is a very interesting topic that is likely to get even more attention in the near future. _We will surely contribute to this.