July 20, 2008
DNA Database Woes and the Birthday Problem
The Los Angeles Times has reported that "A discovery leads to questions about whether the odds of people sharing genetic profiles are sometimes higher than portrayed. Calling the finding meaningless, the FBI has sought to block such inquiry." Actually, the discovery is not new, but the story is still unfolding.
According to the article,
State crime lab analyst Kathryn Troyer was running tests on Arizona's DNA database when she stumbled across two felons with remarkably similar genetic profiles.
The men matched at nine of the 13 locations on chromosomes, or loci, commonly used to distinguish people.
The FBI estimated the odds of unrelated people sharing those genetic markers to be as remote as 1 in 113 billion. But the mug shots of the two felons suggested that they were not related: One was black, the other white.
In the years after her 2001 discovery, Troyer found dozens of similar matches -- each seeming to defy impossible odds.
The key word here is "seeming." This is not the first time partial or even complete matches have appeared in a search of all pairs of DNA profiles in a law-enforcement database. Eight years ago, the National Commission on the Future of DNA Evidence (2000, 25 n.13) reported that
Although brothers and twins are rare in databases, they can be common among those pairs that are found by profile matching. John Buckleton (2000 personal communication) found that, among ten 6-locus matches in a New Zealand database of 10,907 records, all but 2 were brothers (including twins). This shows that the possibility of sibs cannot be ignored in database searches. We should note, however, that these could usually be identified as brothers, either by further investigation or by testing additional loci.
So close relatives are one possible explanation for a seeming surplus of partial matches.
A second consideration is statistical. The random-match probability of 1 in 113 billion quoted in the Times applies to a single comparison between a particular profile and a randomly selected, unrelated individual. It is not the probability that a search through all pairs of profiles in a database composed entirely of records from unrelated people will show a match. Because there are so many pairs to compare, that probability is much greater.
Suppose that there are 500,000 profiles in the database. How many possible pairs can be formed? The answer: 500,000 x 500,000 = 2.5 x 10^11 = 250 billion. How many of these are from different individuals? Answer: Subtract the 500,000 pairs [(1,1), (2,2), ... , (500,000, 500,000)]. That hardly changes anything, since 500,000 is nothing compared to 250,000,000,000. How many are from distinct pairs of people? Answer: Half, since the pair (1,2) is the same as (2,1), etc. Conclusion: There are almost 125 billion pairs to search.
How many comparisons would be expected to match if, for every comparison, the chance of a match is 1 in 113 billion? Answer: About 1. Even without relatives, the observation of a partial match in such a database would not be so surprising.
Of course these numbers do not pertain to the Arizona database. I do not know how large it was, and the chance of a match in each comparison was not constant. But the example shows why the random-match probability grossly understates the chance of a partial-match in an all-pairs trawl in a large database.
In probability theory, this situation is known as a birthday problem. The chance that one randomly selected person has the same birthday as mine is about 1/365. The chance that at least two people in a room full of people have the same birthday (whatever it might be) is much, much larger.
We can expect further studies of the databases for consistency with the estimated random-match probabilities. The article reports on several that have taken place so far. My prediction is that when the dust settles, the results will be inconclusive. Judges will struggle a bit with the birthday problem, and it will be difficult or impossible to determine all the close relatives in the database. Scientists who accept the existing random-match probabilities as reasonable estimates won't change their minds. Well, maybe they'll give up a power of ten or so. Individuals who distrust the estimates will continue to distrust them.
Felch, Jason, and Maura Dolan. 2008. "How Reliable Is DNA in Identifying Suspects?" Los Angeles Times: July 20, 2008. <http://www.latimes.com/news/local/la-me-dna20-2008jul20,0,5133446.story>
National Commission on the Future of DNA Evidence 2000. The Future of Forensic DNA Testing: Predictions of the Research and Development Working Group. Washington DC: National Institute of Justice
July 20, 2008 | Permalink
TrackBack URL for this entry:
Listed below are links to weblogs that reference DNA Database Woes and the Birthday Problem:
Hey you should have at least given the fun answer to the birthday problem. There is about a 50% chance that someone will share your birthday in a group of 23 people. A much smaller group size than most people would estimate.
Posted by: rcubbon | Jul 21, 2008 3:32:11 PM
The problem is when the FBI testifies about DNA. The FBI lies to juries by suggesting that there's just no possible way really that two people could share the same DNA profile of 13 loci.
It's quite possible. And the FBI knows it, and wants to conceal this fact from juries.
The FBI, when it needs to, will testify in exactly the same way about 12 loci. Or 11 loci. Or 10 loci. Or 8 loci. The FBI will tell a jury that the statistical odds that any two people share that 8 loci profile are statistically insignificant.
This causes juries to convict people even though other evidence may tend to point to other possible defendants. Juries will ignore other evidence in favor of FBI testimony.
The only problem is that the FBI is lying. It's statistically very easy to have two people with the same 8-loci DNA profile, and the FBI knows it, and is trying to hide this fact from judges and juries.
And, if truth be said, the FBI's tactics in hiding this information from judges and juries amounts to a conspiracy to obstruct justice. They have been threatening state database administrators and local judges to inhibit searches through its database.
That's right folks, the FBI has been threatening judges.
Where there's fire?
Posted by: courtwatcher | Jul 21, 2008 3:48:32 PM
The LAT story says that the expected number of matches in the Arizona study is around 100, but Troyer ended up with 144. It doesn't say anything about the expected range, though, so it's impossible to tell from the article if the findings mean the FBI's estimates are off or not. What I do find troubling is that the FBI apparently doesn't want these studies to take place at all. If they do, and the only result is to confirm that they were right, great. If they show that the estimates were wrong, we're better off knowing that.
Posted by: Xrlq | Jul 21, 2008 3:54:52 PM
The math is very straightforward to those with any exposure to probability calculations. The numbers can easily be explained, and the claimed probabilities checked, by almost anyone who's interested. An "appeal to authority" is not needed.
But the FBI doesn't seem to be handling the crisis that way. Full disclosure should be the proper reaction - how large is each database, and what are the claimed probabilities of matches for various numbers of loci. With that info, sceptics can check the numbers for themselves.
On the other hand, considering the excitement caused by general misunderstanding of the infamous "Monty Hall" problem, maybe they CAN'T check the numbers for themselves. But it seems extreme to be asked to take FBI's word for it.
Posted by: tom swift | Jul 21, 2008 4:31:58 PM
Perhaps the critical factor in this debate is that those who're in favor of DNA matching have been more than a little deceptive in the statistics they're using. A number like one in 113 billion leaves jurors with the impression that it is unlikely than anyone else in our world of six billion has a match. These unrelated pairs at the state and national level shows that isn't true
Keep in mind an important factor. The pool of potential criminals isn't everyone in the world. For a murder, it's typically the circle of acquaintances and relatives that person has or, drawing the circle a bit wider, the local community in which both the victim and defendant live. That is a far narrow genetic pool than one assuming (bizarrely) that the criminal could live anywhere on the planet.
Imagine a murder in a long-established town with 10,000 people, a town where few people move in or out. Imagine there is a DNA match with someone who has a criminal history that means he has DNA in the state database. He seems to fit the 'gold standard' with a 13-point match. Case closed?
Hardly. In this hypothetical case, the crime was actually committed by a second cousin to the one charged, someone who shares not one but three common ancestors in the previous four generations. But because the latter has a clean record, he's not in any DNA database to provide a match. He's never suspected because the police have what they think is a perfect DNA match and the FBI is telling them they need look no further.
In short, DNA matching needs less stonewalling by FBI experts and more objective research. Some of that research should focus on groups that, because of geography, race or other factors, are likely to have less genetic diversity. And in any investigation the police should keep in mind that the real pool of those who might have committed that crime is likely to much less genetically diverse than the population as a whole. Statistics that assume otherwise are bogus.
--Michael W. Perry, editor of Eugenics and Other Evils.
Posted by: Mike Perry | Jul 21, 2008 4:50:48 PM
Nitpick ... specifically, the number of possible pairings is given by (N * (N-1))/2, which if N = 500,000 works out to just under 125 billion. The important thing to remember is that pairings scale with the square of the set size, which is why your overall point is well taken.
Posted by: Jay Manifold | Jul 21, 2008 5:01:23 PM
If these long odds were really impossible, I doubt that life would exist.
The ACLU, however, is working on an infinite improbability generator which will make such statements of statistical validity totally invalid.
Posted by: AST | Jul 21, 2008 5:36:25 PM
I subscribe to the LA Times and was very much interested in this article. If you refer to the printed version, you will find additional data in a sidebar. For Maryland, there were 33 matches in a database of 20000, or 1.65x 10-3. For Arizona, there were 144 matches in a database of 65000, or 2.2x10-3. For Illinois, there were 903 matches in a database of 230000, or 3.9x10-3. I don't see the n^2 dependence here - in fact it looks more like a constant value about 3x10-3. It seems that here is direct experimental evidence of the individual probability of a 9 point match. Unfortunately, I can't get the observed approximately uniform matching rate, either.
Posted by: Jeff Thomson | Jul 21, 2008 5:56:36 PM
If you're not careful one of you guys is going to blow one of my favorite bets. I offer to bet that of the next 10 cars to pass, two will share the same last two digits of their license plate number. Since there are 100 possibilities (00-99) it would appear that the odds are 1 in 100, so I ask for 20-1 odds. In actuality, per the birthday problem above, the actual odds are more like 10-1. People fall for this one all the time.
Posted by: JorgXMcKie | Jul 21, 2008 7:20:05 PM
rcubbin: "There is about a 50% chance that someone will share your birthday in a group of 23 people."
It's been many years since my last statistics class, but I believe you mis-stated this. It's not that there's a 50% chance that someone in a group of 23 people will share *your* birthday; there's a 50% chance that some two people in a group of 23 will have a common birthdate.
Posted by: JSR | Jul 21, 2008 8:03:16 PM
How does this impact those who have been cleared and release from prison on DNA evidence?
Posted by: Donna B. | Jul 21, 2008 8:05:12 PM
The birthday analogy is one I remember well. I had a Jr. High School math teacher that would take bets that there were two kids in the class -- any class -- that were born on the same date. He seldom paid out, because, as he explained the math, if there were more than 36 kids in the class, the odds were in his favor.
Posted by: Hoystory | Jul 21, 2008 9:37:01 PM
Xrlq wrote ...
"... it's impossible to tell from the article if the findings mean the FBI's estimates are off or not."
This is not accurate; the FBI's estimate was correct.
144 *is* about 100 when taken in the context of 125 billion possibilities. To state otherwise reveals either a conceptual misunderstanding or a deliberate attempt to mislead.
Note -- I am NOT attempting to justify or excuse the bad (unethical?) behavior of the FBI around this issue. In fact, just the opposite. It appears that the truth was sacrificed to win convictions.
Editorial comment added by DHK:
I think the figure of 100 is the expected number of partial matches already taking into account the number of comparisons. (The latter number is not 125 billion. That came from a hypothetical example I used to illustrate the combinatorial explosion.)
To determine whether the excess of 44 partial matches is improbable (when the random-match probability is only 1 in 113 billion), we need the standard error associated with the estimated number of 100. The article did not give that number.
In addition to the statistical uncertainty, the figure of 100 presumes that everyone in the database is unrelated. What if 20% of them are close relatives? 10%? 5%? Could that account for the excess? Without more information, who can say?
Posted by: Tom J | Jul 22, 2008 10:35:50 AM