Because of the sensitivity of the data encoded in social networks such as Facebook and MySpace, there has been a lot of interest recently in techniques to anonymize such data. There are many potential applications for studying such social networks, such as understanding how information and disease can spread, as well capturing how people tend to interact with each other. But when information about networks has been collected by a site, it is often not desirable to release the data unmodified, since this could allow inference about individuals, while the goal of the information release is to allow inference about the group as a whole but not any one person. In medical and other private databases, there has been a lot of study of how to anonymize private data in order to allow publishing of anonymized data, but network data brings additional challenges.
More formally, model the network as a simple graph G. Even publishing solely the nodes and edges of G, without any identifying information, may still reveal information, since a node of sufficiently high degree may be uniquely identifying. In some recent work, Dwork et al consider both active and passive attacks on such data, and show that with the collusion of a small group of people represented in the graph, some additional information can be extracted. The results follow by considering a variation of the random graph model, and arguing that some small but large enough structures are likely to be unique within the graph. Other work in this model has studied other attacks on anonymized data, using machine learning perspectives, or a more graph theoretic approach. The conclusions of the work so far seems to be that many natural anonymization schemes do not guarantee privacy against some reasonably strong attack models, but nor do they guarantee exposure. Even various perturbations of the input data seem to be insufficient to limit the efficacy of an attack while maintaining utility of the published data.
This leaves the question, if the strongest privacy guarantees are not possible, is it sufficient to publish "best-effort" anonymized data, and hope that no serious damage is done? As we discussed in a prior post, within a few days of being released, it was claimed that there were weaknesses in the anonymization of the famed Netflix data set, but this does not seem to have led to any major revelations or upset as a consequence.