Yesterday, security researcher Ron Bowes published a 2.8GB database of information collected from public Facebook pages. These pages list all users whose privacy settings enable a public search listing for their profile. Bowes wrote a program to scan through the listings and save the first name, last name, and profile URI of each user (though only if their last name began with a Latin character). The database includes this data for about 171 million profiles.
On the one hand, I wasn’t entirely surprised by this news – it was only a matter of time before someone started building up such a dataset. I’ve previously mentioned that developer Pete Warden had planned on releasing public profile information for 210 million Facebook users until the company’s legal team stepped in. But nothing technical prevented someone else from attempting the task and posting data without notice. I imagine Facebook may not be too happy with Bowes’ data, but I’m not going to delve into the legal issues surrounding page scraping.
However, the event did remind me of a related issue I’ve pondered over the last few months: the notion of “security through obscurity” as it relates to privacy issues.
I’ve often referenced the work of danah boyd, a social media researcher that I highly respect. In a talk earlier this year at WWW2010 entitled, ”Privacy and Publicity in the Context of Big Data,” she outlines several excellent considerations on handling massive collections of data about people. One in particular that’s worth remembering in the context of public Facebook information: “Just because data is accessible doesn’t mean that using it is ethical.” Michael Zimmer at the University of Wisconsin-Milwaukee has made similar arguments, noting that mass harvesting of Facebook data goes against the expectations of users who maintain a public profile for discovery by friends, among other issues. Knowing some of the historical issues with academic research involving human subjects, I tend to agree with these positions.
But a related point from boyd’s talk concerns me from a security perspective: “Security Through Obscurity Is a Reasonable Strategy.” As an example, she notes that people talking in public settings may still discuss personal matters, but they rely on being one conversation among hundreds to maintain privacy. If people knew other people were specifically listening to their conversation, they would adjust the topic accordingly.
In this “offline” example, taking advantage of obscurity makes sense. But boyd applies the same idea online: “You may think that they shouldn’t rely on being obscure, but asking everyone to be paranoid about everyone else in the world is a very very very unhealthy thing…. You may be able to stare at everyone who walks by but you don’t. And in doing so, you allow people to maintain obscurity. What makes the Internet so different? Why is it OK to demand the social right to stare at everyone just because you can?”
I would respond that at least three aspects make the Internet different. First, you rarely have anyway of knowing if someone is “staring at you” online. Public content on Facebook gets transferred to search engines, application developers, and individual web surfers every day without any notification to the creators of that content. Proxies and anonymizers can spoof or remove information that might otherwise help identify the source of a request. And as computing power increases each day, tracking down publicly accessible resources becomes ever easier.
Second, the nature of online data means that recording, parsing, and redistributing it tends to be far simpler than in the offline world. If I want to record someone’s in-person conversations, it’s theoretically possible that I could acquire a small recording device, place it in a convenient location, save the audio from it, type up a transcript of the person’s words, then send it to another person to read. But if I want to record someone’s conversations on Twitter (as an example), I can have all them in a format understandable to various computer-based analysis tools in just a few clicks. In fact, I could setup an automated system which monitors the person’s Twitter account and updates me whenever certain words of interest appear. Add the fact that this is true of any public Twitter account, and the capabilities for online monitoring grow enormously.
Finally, while digital content is in some ways more ephemeral than other media, web data tends to persist well beyond a creator’s ability to control. Search engine caches, archival sites, and user redistribution all contribute to keeping content alive. If someone records a spoken conversation on a tape, the tape can be destroyed before copies are made. But if you (or a friend of yours) post a sentence or photo on a social networking site, you may never be able to erase it fully from the Internet. Several celebrities have learned this the hard way lately.
From a privacy perspective, I wholeheartedly agree with boyd that we can’t expect users to become paranoid sysadmins. The final point of my own guide to Facebook privacy admonished, “You Have to Live Your Life.” But from a security perspective, I know that there will always be people and automated systems which are “staring at you” on the Internet. I’ve seen time and again that if data is placed where others can access it online, someone will access it – perhaps even unintentionally (Google indexes many pages that were obviously not meant for public consumption).
In my opinion, the only way to offer any setup online which resembles the sort of “private in public” context boyd described requires some sort of a walled garden, such as limiting your Facebook profile to logged in users. That alone still doesn’t provide the same degree of privacy, since many fake profiles exist and applications may still have access to your data. But while “security through obscurity” (or perhaps more accurately, privacy through obscurity) may be a decent strategy in many “offline” social situations, it simply can’t be relied on to protect users and data online.
Facebook users are starting to discover this firsthand. I’ve seen several reactions to Bowes’ release that characterize it as a security issue or privacy issue, and people have seemed quite surprised that building such a dataset was even possible. Yet it really shouldn’t come as a surprise to someone familiar with current technology and ways of accessing Facebook data. And it won’t be the last time we see someone make use of “public” data in surprising ways. Some of these uses may be unfortunate or unethical (see above), but we’ve often seen technology steam ahead in pursuit of fortune, and the web has many users with differing ideas on ethics. Reversing the effects of such actions may prove impossible, which is why I would argue we need to prevent them by not trusting obscurity for protection. And how do we balance this perspective to avoid unhealthy paranoia? I’m honestly not sure – but if content is publicly accessible online without any technical limitations, we can hardly consider it immune to publicizing.