Monday, August 27, 2007

Anonymizing Social Network Data

Because of the sensitivity of the data encoded in social networks such as Facebook and MySpace, there has been a lot of interest recently in techniques to anonymize such data. There are many potential applications for studying such social networks, such as understanding how information and disease can spread, as well capturing how people tend to interact with each other. But when information about networks has been collected by a site, it is often not desirable to release the data unmodified, since this could allow inference about individuals, while the goal of the information release is to allow inference about the group as a whole but not any one person. In medical and other private databases, there has been a lot of study of how to anonymize private data in order to allow publishing of anonymized data, but network data brings additional challenges.

More formally, model the network as a simple graph G. Even publishing solely the nodes and edges of G, without any identifying information, may still reveal information, since a node of sufficiently high degree may be uniquely identifying. In some recent work, Dwork et al consider both active and passive attacks on such data, and show that with the collusion of a small group of people represented in the graph, some additional information can be extracted. The results follow by considering a variation of the random graph model, and arguing that some small but large enough structures are likely to be unique within the graph. Other work in this model has studied other attacks on anonymized data, using machine learning perspectives, or a more graph theoretic approach. The conclusions of the work so far seems to be that many natural anonymization schemes do not guarantee privacy against some reasonably strong attack models, but nor do they guarantee exposure. Even various perturbations of the input data seem to be insufficient to limit the efficacy of an attack while maintaining utility of the published data.

This leaves the question, if the strongest privacy guarantees are not possible, is it sufficient to publish "best-effort" anonymized data, and hope that no serious damage is done? As we discussed in a prior post, within a few days of being released, it was claimed that there were weaknesses in the anonymization of the famed Netflix data set, but this does not seem to have led to any major revelations or upset as a consequence.

A Warning on Privacy: How Private Is Your Social Networking Site?

[Posting by Fred Roberts]

If you think that what you put on MySpace, Facebook, and other social networking sites is private, think again. According to an August 2007 Star Ledger article, social networking sites cooperate with police, even if an individual’s site is “private”. The police are going to such sites to get clues to solve crimes, curb underage drinking, seek clues to vandalism, and search out gang membership. Intelligence agencies are making use of these sites in their efforts to monitor potential terrorists before they can implement their plans. According to Time Magazine (August 20, 2007), 12% of employers admit to using these sites to investigate potential employees, and we are told of cases where students have lost potential jobs because of what employers found on their social network sites. Before posting information to your social network site, think about whether you really want the police, prospective employers, or your parents to read it – it is more public than you may think!

Saturday, June 30, 2007

Visualizing social interactions

Everyday we are part of multiple communication graphs (emails, tel calls, blogs, IP traces) and play different roles there. What is the structure in them, and what are suitable signatures we display? A recent study of usenet groups provides some answers.

Thursday, May 17, 2007

Virtual Network Analysis

As a part of the DyDAn seminar series, Victor Asal and Karl Rethemeyer gave a talk on Empirical and Virtual Terrorist Networks. The talk was about determining and analyzing factors indicating how dangerous a terrorist group is. Using statistical analysis, the speakers showed that factors such as connectivity of the organizations and their ideology have strong correlations with their lethality. They supported their claims using both historic terrorist data from MIPT Terrorism Knowledge Base and that extracted from websites of terrorist organizations. The speakers emphasized on the potential of monitoring web-based interactions (possibly hidden or disguised) between terrorists and terrorist groups for predicting and preventing future attacks.

The speakers had an impressive visualization of the interactions between 200 terrorist organizations in the form of a graph. This graph was packed with information like lethality and of the organizations, religious-orientation, known and predicted alliances between the groups. Currently, this information is manually created by painstaking data compilation. It remains a challenge to automate this process, or provide some automated assistance to more easily compile the information.

Saturday, May 5, 2007

Nature of Blogs

There are many different blog publish/ponder services out there (myspace, blogger, xanga, livejournal, facebook, friendster, etc); they seem to attract and foster different types of communities. Studies of the bloggers in these different services reveal inherent properties: people tend to have different profiles in age and geographic location, connections within one blog service are more extensive than across services, and there is a lot of information that relates to services other than blogs. Details at:

Thursday, May 3, 2007

Analyzing Semantic Graphs

A lot of data analysis work has studied various aspects of graph structured data (comprising entitites, or nodes, with some pairs linked together by edges). But a simple graph is often insufficient to capture the complex relations seen in practice: although connectivity within a computer network can be represented by a graph, there are many different kinds of nodes (computers, routers, switches) and different kinds of links between them (DSL lines, fiber links etc.). Consequently, the idea of "semantic graphs"-- graphs which include this richer information on types of entities and different links between them--are of increasing interest. Moreover, in practice these graphs are not static, by changing over time as new connections are formed or discovered, leading to dynamic semantic graphs.

Back in March, Tina Eliassi-Rad visited DyDAn and talked about some of her recent work on Dynamic Semantic Graph analysis. Many natural questions she introduced lead to concrete research problems:
  • How to visualize a semantic graph, and adequately convey the different relationships therein?
  • How to find paths of connections in semantic graphs, and determine if these are significant?
  • How to efficiently find instances of a particular subgraph within a large Semantic Graph (with wildcard matching, etc.)?
  • How to find patterns or anomalies within the data, indicating unexpected relations or features to investigate in more detail?
She talked in detail about the problem of learning functions within the graph. The key insights are to be able to combine ideas from both machine learning and graph theory, rather than try to reduce to one alone. The topological features of a graph are important features to provide for a learning algorithm. Fully incorporating the time-dynamics remains a major challenge.

A recent workshop at DyDAn also studied the problems of semantic graphs from a variety of perspectives -- the slides are available from the workshop website.

Saturday, April 28, 2007

Privacy preserving data analysis

Theoretical computer science formulated the basic notions of how to analyze data without revealing any more information than is necessary for the analysis (more or less). The bottomline is a lot of computations can be done in that manner, although with some computational overhead. A nice set of algorithmic tools have also been developed for specific problems. A collection of papers can be found here. These techniques are now being adopted to other communities, eg.,
  • databases, see eg., here.
  • networking, see eg., here by Matt Roughan and Yin Zhang.
There is a real need to formulate and address problems with social network analysis using privacy preserving methods.

Saturday, March 10, 2007

Netflix Prize and Privacy

Netflix has released a sample of data publicly and has offered a significant prize to predict peoples' rating for movies better than their baseline solution. The data comprises customers and their ratings of movies and the times the ratings were posted. Customers are anonymized, by removal of immediately identifying information.  Some noise is introduced by some perturbations of the data, and a claim that only a small amount of information from any one customer is included.

In a recent study, the authors experimentally evaluate the following: for each customer c and k of their movie rating and times given, determine the number N(c,k) of other customers who posted "nearly" same ratings at "nearly" same times for these movies? The argument is that for small k, if N(c,k) were small, then customer c can be disambiguated with some effort or knowledge.

The paper appears to be in a preliminary form. From a theoretical point of view, there are nice problems here. Can we propose a probabilistic model for customer ratings and derive an estimate for N(c,k)?

Sunday, March 4, 2007

What blogs reveal?

What blogs reveal about one is a deeply technical or deeply philosophical question. They have attracted a lot of interest academically. In computer science, people have abstracted problems based on text attributes, analogies with epidemiology on the spread of "memes" through blogs, studied graph properties and so on. In doing such study, one becomes quite familiar with the literature from people in CS and related fields, but also start to learn how they are approached by other fields such as sociology and psychology. An example is a study on gender, identity and language in teenage blogs, done a couple of years ago. A long (exhaustive?) list of blog related research from all discplines appears on the homepage of a blogging researcher / researching blogger.

Privacy and Confidentiality

Stephen Fienberg gave a nice talk with a statistical perspective on data privacy at the kickoff meeting of DyDAn. He introduced the need for privacy awareness in data handling, and some of the hazards if it is ignored (slides are here). He spoke in detail about giving bounds on cells of a contingency table given certain marginals; this was based on his PNAS paper. He also publicized a upcoming WWW 2007 paper by Backstrom, Dwork and Kleinberg on theoretical analysis of privacy in social networks or lack thereof. Both these papers raise a lot of fascinating questions, and lead to a lot of scope for further study, in statistics and theoretical computer science.