Netflix has released a sample of data publicly and has offered a significant prize to predict peoples' rating for movies better than their baseline solution. The data comprises customers and their ratings of movies and the times the ratings were posted. Customers are anonymized, by removal of immediately identifying information. Some noise is introduced by some perturbations of the data, and a claim that only a small amount of information from any one customer is included.
In a recent study, the authors experimentally evaluate the following: for each customer c and k of their movie rating and times given, determine the number N(c,k) of other customers who posted "nearly" same ratings at "nearly" same times for these movies? The argument is that for small k, if N(c,k) were small, then customer c can be disambiguated with some effort or knowledge.
The paper appears to be in a preliminary form. From a theoretical point of view, there are nice problems here. Can we propose a probabilistic model for customer ratings and derive an estimate for N(c,k)?
Sunday, March 4, 2007
What blogs reveal about one is a deeply technical or deeply philosophical question. They have attracted a lot of interest academically. In computer science, people have abstracted problems based on text attributes, analogies with epidemiology on the spread of "memes" through blogs, studied graph properties and so on. In doing such study, one becomes quite familiar with the literature from people in CS and related fields, but also start to learn how they are approached by other fields such as sociology and psychology. An example is a study on gender, identity and language in teenage blogs, done a couple of years ago. A long (exhaustive?) list of blog related research from all discplines appears on the homepage of a blogging researcher / researching blogger.
Stephen Fienberg gave a nice talk with a statistical perspective on data privacy at the kickoff meeting of DyDAn. He introduced the need for privacy awareness in data handling, and some of the hazards if it is ignored (slides are here). He spoke in detail about giving bounds on cells of a contingency table given certain marginals; this was based on his PNAS paper. He also publicized a upcoming WWW 2007 paper by Backstrom, Dwork and Kleinberg on theoretical analysis of privacy in social networks or lack thereof. Both these papers raise a lot of fascinating questions, and lead to a lot of scope for further study, in statistics and theoretical computer science.