Open datasets demand robust privacy protections

Open datasets demand robust privacy protections

There is significant potential for abuse when privacy protections in open data are insufficient or nonexistent.

metrics and data shown on a computer screen
Image by :

Get the newsletter

Join the 85,000 open source advocates who receive our giveaway alerts and article roundups.

Machine learning systems and other algorithms increasingly rely on open datasets on sites like Kaggle to run data science applications and train machine learning models. This isn't the case for one specific area of work—it holds true with applications from medical analytics to crime prediction to natural language processing.

When downloading enormous files with thousands, hundreds of thousands, or even millions of data points, it's tempting to forget about the individuals behind each piece of information. But human beings are behind these datasets, and as more and more data is publicly and openly released by private and public institutions—whether to aid research, comply with disclosure agreements, or something else—we need robust privacy protections to safeguard the information of people included in a dataset with or without consent.

For a while, so-called "anonymization" was the answer to adding privacy protections to datasets. In this process, an individual's name, for instance, would be replaced with a random number, while the rest of the attributes associated with that individual would remain unaltered. Totally set, right? Wrong.

Anonymization is not a robust way to ensure an individual's data in a larger dataset is protected. As security expert Bruce Schneier writes, there are "inherent security problems" with this approach; it's flawed thinking that simply swapping out a name with a numeric string would remove all possible identifiers or links to the individual. Real-world case studies prove this fact.

In 2006, Netflix released 10 million movie rankings created by half a million customers to encourage people to develop a superior recommendation system. Researchers at the University of Texas-Austin were able to partially de-anonymize the data by linking Netflix's data points to "auxiliary information" found on the Internet Movie Database (IMDb), "personal blogs, Google searches, and so on." Around the same time, AOL released 20 million web searches online, after which The New York Times cross-referenced the data with phone book listings to similarly identify the individuals behind the numbers. You can find other examples online.

These so-called privacy attacks allow researchers and malicious adversaries to discover who is behind the mask, so to speak, in open datasets—linking seemingly anonymous or randomly sampled information to specific people. To address this problem, differential privacy—which involves adding "noise" to databases—is an emerging standard in computer science to protect an individual's privacy while still maintaining a dataset's relative utility. And the absence of this protection is particularly menacing when open datasets involve sensitive personal information., for example, provides an easily searchable index of thousands of datasets. Want information on adult tobacco consumption? Measures of rehospitalization, emergency room visits, and community discharges? It's all there.

Even cities are publishing data online as they increasingly use machine learning systems and other algorithms to bolster their existing administrative functionalities and build out new ones: identifying potholes in roads, conducting risk assessments on the homeless, reducing transit congestion, minimizing traffic accidents, predicting the occurrence of flash floods, fighting rodents, predicting illegal grease disposal, and much more. Forbes has counted at least 90 municipalities with open data portals; while most are larger cities, I imagine this list will expand to include smaller areas within a few years.

To use a current example, New York City releases thousands of publicly available datasets online via its Open Data project. For instance, New York City's Taxi and Limousine Commission publicly releases data, by month, on taxi and limousine trips around the city. "The yellow and green taxi trip records," the site reads, "include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts." And for-hire vehicle trip records "include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID."

Most of these datasets do not implement robust privacy protections. And with all this data out there and in the open, there is significant potential for abuse when privacy protections are insufficient or nonexistent. This occurred with the taxi data in New York City when researchers examined how the dataset could reveal information about drivers' home addresses and income, as well as the detailed travel manifests of passengers, which could also be compromising.

Obviously, the organization releasing any dataset already has access to the raw, original, unprotected information, and while there are concerns to be raised about this fact—such as how the data is collected or the ethics of its use—that's not our focus here. Instead, think about how other organizations (besides the discloser) could use the data. A corporation could publish GPS logs from fitness wearables, which a government could use to track people's movements post-facto or in near-real-time. Or a city could release data on some of its residents, which a corporation could then use to spike individuals' insurance rates or derive detailed travel histories. There is potential for real harm to occur.

Arguing why data privacy matters can be challenging in a world where a) many are uninformed about just how pervasively they're being watched, b) others are ambivalent that the surveillance matters, and c) others yet make such proclamations as "privacy is dead" and conclude that we should just move on. These are all dangerous realities, because not caring about privacy is a privilege: "Privacy violations harm the most vulnerable among us," I've written, and "a belief that data privacy protections 'might not matter' is only a lack of fear that the information won't be used against you." We might not care if our information is inferred by an algorithm and made accessible to decisionmakers, but that doesn't hold true for everyone. And, more broadly, perhaps we didn't consent to these potentially compromising disclosures in the first place.

So, as we think about issues like algorithmic bias and building representative datasets, we also need to think about how our societies can mandate minimum privacy protections in openly published datasets—especially in the case of government institutions, which are already subject to laws around information disclosure. This could involve watchdog groups, laws around minimum privacy thresholds on datasets, and more; and as public concerns about data privacy grow (albeit perhaps too concentrated on Facebook and not concentrated enough on other companies as well), market pressure may too play a role.

To build robust privacy protections into open datasets, therefore, government entities at the federal, state, and municipal levels need to hold formal and informal dialogues between policymakers and technical experts on this issue. Because the last thing we need is a technically uninformed policy that doesn't help, or even hurts, the people it's meant to protect.

About the author

Justin Sherman - Justin Sherman is a student at Duke University double-majoring in computer science and political science. He is a Fellow at Interact; a Fellow at the Duke Center on Law & Technology; and a Cybersecurity Policy Fellow at New America. He is the Co-Founder and Vice President of Ethical Tech (, which works to empower ALL people to have a voice in technology innovation, consumption, and regulation. He has written extensively on...