Saturday, March 10, 2007

Netflix Prize and Privacy

Netflix has released a sample of data publicly and has offered a significant prize to predict peoples' rating for movies better than their baseline solution. The data comprises customers and their ratings of movies and the times the ratings were posted. Customers are anonymized, by removal of immediately identifying information.  Some noise is introduced by some perturbations of the data, and a claim that only a small amount of information from any one customer is included.

In a recent study, the authors experimentally evaluate the following: for each customer c and k of their movie rating and times given, determine the number N(c,k) of other customers who posted "nearly" same ratings at "nearly" same times for these movies? The argument is that for small k, if N(c,k) were small, then customer c can be disambiguated with some effort or knowledge.

The paper appears to be in a preliminary form. From a theoretical point of view, there are nice problems here. Can we propose a probabilistic model for customer ratings and derive an estimate for N(c,k)?