September 5, 2008
Genetics Datasets Closed Due to Forensic DNA Discovery
Until last Friday, the National Institutes of Health (NIH) and other groups had posted large amounts of aggregate human DNA data for easy access to researchers around the world. On Aug. 25, however, NIH removed the aggregate files of individual Genome Wide Association Studies (GWAS). The files, which include the Database of Genotypes and Phenotypes (dbGaP), run by the National Center for Biotechnology Information, and the Cancer Genetic Markers of Susceptibility database, run by the National Cancer Institute, remain available for use by researchers who apply for access and who agree to protect confidentiality using the same approach they do for individual-level study data.) The Wellcome Trust Case Control Consortium and the Broad Institute of MIT and Harvard also withdrew aggregate data.
The reason? The data keepers fear that police or other curious organizations or individuals might deduce whose DNA is reflected in the aggregated data, and hence, who participated in a research study. These data consist of SNPs -- Single Nucleotide Polymorphisms. These are differences in the base-pair sequences from different people at particular points in their genomes. Many SNPs are neutral -- they do not have have any impact on gene expression. Nonetheless, they can be helpful in determining the locations of nearby disease-related mutations.
The event that prompted the data keepers to act was the discovery at the Translational Genomics Research Institute (TGen) of a new way to check whether an individual's DNA is a part of a complex mixture of DNA (possibly from hundreds of people). According to the TGen report, Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays, a statistic applied to intensity data from SNP microarrays (chips that detect tens of thousands of SNPs simultaneously) reveals whether the signals from an individual's many SNPs are consistent with the possibility that the individual is not in the mixture. (Sorry for the wordiness, but the article uses hypothesis testing, and "not in the mixture" is the null hypothesis.)
How could this compromise the research databases? As best as I understand it, the scenario is that someone first would acquire a sample from somewhere. Your neighbor might check your garbage, isolate some of your DNA, get a SNP-chip readout, and check it against the public database to see if you were a research subject who donated DNA. Or, the police might have a crime-scene sample. Then they would use a SNP-chip to get a profile to compare to the record on the public database to see if the profile probably is part of the mixture data there. Finally, if they got a match, the police would approach the researchers to get the matching individual's name.
Kathy Hudson, a public policy analyst at Johns Hopkins University, stated in an email that “While a fairly remote concern, and there are some protections even against subpoena, NIH did the right thing in acting to protect research participants.” However, scientists such David Balding in the U.K. are complaining that the restrictions on the databases are an overreaction. Indeed, an author of the TGen study is quoted as stating that the new policy is "a bit premature." See http://www.nature.com/news/2008/080904/full/news.2008.1083.html.
It seems doubtful that anonymity of the research databases has been breached, or will be in the immediate future, by this convoluted procedure. Of course, the longer-term implications remain to be seen, and the technique has obvious applications in forensic science. If the technique works as advertised, police will be able to take a given suspect and determine whether his DNA is part of a mixture from a large number of individuals that was recovered at a crime scene. Analyzing complex mixtures for identity is difficult to do with standard (STR-based) technology.
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al., Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays, PLoS Genetics (2008). 4(8):e1000167. doi:10.1371/journal.pgen.1000167
DNA databases shut after identities compromised, Nature 455:13. Sept. 3, 2008
Natasha Gilbert, Researchers criticize genetic data restrictions, Nature Sept. 4, 2008, <http://www.nature.com/news/2008/080904/full/news.2008.1083.html>