Wednesday, May 9, 2012
Andrew D. Selbst, New York University School of Law & New York University Department of Culture and Communication, has published Datasets as Reporters’ Confidential Sources. Here is the abstract.
We live in the age of Big Data. Researchers, industry, and government are all marshalling data to accomplish big tasks. Journalists are also embracing data more and more each day, changing what it means to be a reporter. Whereas before journalists might have acted as anthropologists, they are rapidly becoming social scientists and computer programmers looking for patterns in datasets. The privacy implications of the growing collecting, assembly, and use of large datasets by journalists have been largely unrecognized and inadequately analyzed. This article begins the task of studying the privacy implications of growing journalistic reliance on large datasets by studying how and whether current legal protections for reporters’ relationships with more traditional sources – the so-called “shield laws” – address the issue of data privacy and considering how these laws might be amended to deal with the privacy challenges of journalistic use of personal data.
As journalists become accustomed to collecting and analyzing quantitative data, they will probably become more and more interested in using data mining and other statistical methods to uncover newsworthy stories. Just as FICO is now able to use job status and home ownership data to predict how likely patients are to take their prescriptions and Target can use an array of data about purchases to identify shoppers who are likely to be pregnant, collecting and analyzing a wide variety of types of data can yield ideas for fascinating stories.
As with any industry that embraces data, though, data journalism creates privacy concerns. For example, a story about drug use throughout a city might prove dangerous for even anonymous survey takers if it based on a dataset that includes IP addresses and the raw data is made available to law enforcement. Moreover, as news sources increasingly seek to combine their traditional reporting with interactive online materials, they may want to make their data publicly available to readers. Journalists may not be trained in de-identification techniques for public distribution of datasets and, as many studies have shown, the re-identification of data is often possible, particularly if poor de-identification techniques are employed. Additionally, some journalists are surely using identified data without well-thought-out policies about how to store such data or when to disclose it to those who wish to use it for fact-checking a journalist’s story, for research purposes, or, in some cases, for law enforcement purposes.
Luckily, most states have a tool designed to enable newsgathering while protecting privacy: reporters’ shield laws. These laws allow reporters to resist subpoenas that would require production of their sources. How these laws apply to datasets is a critical question as data journalism becomes more prominent. Laws variably refer to identifying, producing, and disclosing sources, and it is unclear how these terms would apply to datasets as opposed to people. Some shield laws explicitly protect data, though based on the reasoning it is journalist “work-product”, rather than as a source. Finally, the most common rationale for shield laws – encouraging people to speak to journalists – simply does not apply to all types of datasets.
This article analyzes the privacy problems raised by journalistic use of large datasets and compares them to privacy issues raised by more familiar journalistic practices. It surveys current reporters’ shield laws to predict how likely data sets are to garner protection and what form the protection would take under current law. The article will also address the questions that arise out of this analysis and the role of shield laws in a broader system of privacy protections designed to enable data journalism to thrive while limiting potential problems and abuses.
The full text is not available from SSRN.