July 20, 2008
DNA Database Woes and the Birthday Problem
The Los Angeles Times has reported that "A discovery leads to questions about whether the odds of people sharing genetic profiles are sometimes higher than portrayed. Calling the finding meaningless, the FBI has sought to block such inquiry." Actually, the discovery is not new, but the story is still unfolding.
According to the article,
State crime lab analyst Kathryn Troyer was running tests on Arizona's DNA database when she stumbled across two felons with remarkably similar genetic profiles.
The men matched at nine of the 13 locations on chromosomes, or loci, commonly used to distinguish people.
The FBI estimated the odds of unrelated people sharing those genetic markers to be as remote as 1 in 113 billion. But the mug shots of the two felons suggested that they were not related: One was black, the other white.
In the years after her 2001 discovery, Troyer found dozens of similar matches -- each seeming to defy impossible odds.
The key word here is "seeming." This is not the first time partial or even complete matches have appeared in a search of all pairs of DNA profiles in a law-enforcement database. Eight years ago, the National Commission on the Future of DNA Evidence (2000, 25 n.13) reported that
Although brothers and twins are rare in databases, they can be common among those pairs that are found by profile matching. John Buckleton (2000 personal communication) found that, among ten 6-locus matches in a New Zealand database of 10,907 records, all but 2 were brothers (including twins). This shows that the possibility of sibs cannot be ignored in database searches. We should note, however, that these could usually be identified as brothers, either by further investigation or by testing additional loci.
So close relatives are one possible explanation for a seeming surplus of partial matches.
A second consideration is statistical. The random-match probability of 1 in 113 billion quoted in the Times applies to a single comparison between a particular profile and a randomly selected, unrelated individual. It is not the probability that a search through all pairs of profiles in a database composed entirely of records from unrelated people will show a match. Because there are so many pairs to compare, that probability is much greater.
Suppose that there are 500,000 profiles in the database. How many possible pairs can be formed? The answer: 500,000 x 500,000 = 2.5 x 10^11 = 250 billion. How many of these are from different individuals? Answer: Subtract the 500,000 pairs [(1,1), (2,2), ... , (500,000, 500,000)]. That hardly changes anything, since 500,000 is nothing compared to 250,000,000,000. How many are from distinct pairs of people? Answer: Half, since the pair (1,2) is the same as (2,1), etc. Conclusion: There are almost 125 billion pairs to search.
How many comparisons would be expected to match if, for every comparison, the chance of a match is 1 in 113 billion? Answer: About 1. Even without relatives, the observation of a partial match in such a database would not be so surprising.
Of course these numbers do not pertain to the Arizona database. I do not know how large it was, and the chance of a match in each comparison was not constant. But the example shows why the random-match probability grossly understates the chance of a partial-match in an all-pairs trawl in a large database.
In probability theory, this situation is known as a birthday problem. The chance that one randomly selected person has the same birthday as mine is about 1/365. The chance that at least two people in a room full of people have the same birthday (whatever it might be) is much, much larger.
We can expect further studies of the databases for consistency with the estimated random-match probabilities. The article reports on several that have taken place so far. My prediction is that when the dust settles, the results will be inconclusive. Judges will struggle a bit with the birthday problem, and it will be difficult or impossible to determine all the close relatives in the database. Scientists who accept the existing random-match probabilities as reasonable estimates won't change their minds. Well, maybe they'll give up a power of ten or so. Individuals who distrust the estimates will continue to distrust them.
Felch, Jason, and Maura Dolan. 2008. "How Reliable Is DNA in Identifying Suspects?" Los Angeles Times: July 20, 2008. <http://www.latimes.com/news/local/la-me-dna20-2008jul20,0,5133446.story>
National Commission on the Future of DNA Evidence 2000. The Future of Forensic DNA Testing: Predictions of the Research and Development Working Group. Washington DC: National Institute of Justice