## April 18, 2009

### Taking Liberties with the Numbers

This month's issue of the California Lawyer perpetuates the confusion in the media about DNA database trawls. In an article entitled "Guilt by the Numbers: How Fuzzy is the Math that Makes DNA Evidence Look So Compelling to Jurors?," award-winning journalist Edward Humes discusses the unusual case of People v. Puckett, No. A121368, Cal. Ct. App., 1st Dist., May 1, 2008). John Puckett, now an elderly man, is appealing his recent conviction for the 1972 murder of Diane Sylvester, a San Francisco nurse. The conviction rests on a cold hit in California’s convicted-offender database at a small number of STR loci (genetic locations). Hume writes that in Puckett, "the prosecution's expert estimated that the chances of a coincidental match between the defendant's DNA and the biological evidence found at the crime scene were 1 in 1.1 million." Id. at 22. Then he adds "there's another way to run the numbers" which shows that "the odds of a coincidental match in Puckett's case are a whoppiong 1 in 3." Id. "Both calculations," he maintains, "are accurate. The problem is that they answer different questions." Id. The explanation, he believes, lies in "a classic statistical puzzle known as the 'birthday problem.'" Id.

Surely the probability of "a coincidental match" cannot have such fantastically different "accurate" values. Moreover, the birthday problem has almost nothing to do with these numbers. The fuzziness is in the words of the article, not in the math. Only if we define "a coincidental match" can we begin to see what its probability would be and how unlike the birthday problem it is.

Definition 1. The probability of coincidental match is the chance that Mr. Puckett is innocent and the match to him is just a coincidence

The average reader might think that a coincidental match means that Mr. Puckett is innocent and the match to him is just a coincidence.  If this is what it means, however, its probability is neither 1 in 1.1 million nor 1 in 3.  The former figure is the probability that Puckett's DNA would match if he were the only one whose DNA had been checked and if he were unrelated to the killer. The latter figure is the probability that at least one profile in the California database -- not necessarily Puckett's -- would match if no one in the database were the killer.  Notice that both probabilities are conditional -- they depend on assumptions about who the real killer is or is not.  They cannot readily be inverted or transposed into the probability of who the real killer is. Under Definition 1, therefore, neither number is an "accurate" statement of the probability of a coincidental match.  Neither one expresses the chance that the match to Mr. Puckett is just a coincidence.

A technical note: This description of the probabilities of 1 in 1.1 million and 1 in 3 assumes, for simplicity, that it was the killer's DNA that was found near the victim and later typed and that there was no possibility of error in the DNA typing, no ambiguity in the test results, and no selectivity in presenting them. Statisticians will immediately recognize that Bayes' rule could be used to arrive at the posterior probability of Puckett's innocence.

Definition 2. The probability of a coincidental match means the chance that Mr. Puckett's DNA would match (and no other DNA in the database would) if he were not the killer and if he were unrelated to the killer.

This definition refers to the probability of the DNA evidence given the hypothesis of coincidence. Again, neither 1 in 1.1 million nor 1 in 3 expresses this value, but 1 in 1.1 million is a far closer estimate than is 1 in 3. The reason is that the DNA evidence includes not merely the datum that Puckett's DNA matches, but the additional information that no one else's does. If Puckett were the only one tested (a database of size 1) and if he were innocent, then the chance that he would match would be 1 in 1.1 million. Now we test an unrelated second person. The chance that this individual would match if he were innocent also is 1 in 1.1 million, and the chance that he would match if he were the killer is 1. The chance that Puckett matches and the other man does not is therefore either (1/1,100,000) x (1/1,100,000) (if both men are innocent) or 1/1,100,000 x 1 (if Puckett is innocent and the other man is the killer). In other words, the probability that Puckett matches just by coincidence (he matches if he is innocent) in a search of a database of size 2 is, at most, 1 in 1.1 million. Searching the database and finding that only Puckett matches is better evidence than testing only Puckett.  (This reasoning is developed more fully, for a database of any size, in. e.g., David H. Kaye, Rounding Up the Usual Suspects: A Legal and Logical Analysis of DNA Database Trawls, 87 N. Car. L. Rev. 425 (2009).)

Definition 3. The probability of a coincidental match means the chance that one or more DNA profiles in the database would match if no one in the database is the killer.

This definition refers to the probability of one or more hits in the database given that the database is innocent. This probability is approximately 1 in 3. What it has to do with the probability that the DNA in the bedroom was Mr. Puckett's is obscure.  It is not even the expected rate at which searches of innocent databases would lead to prosecutions. After all, the 1 in 3 figure includes people who were not even born in 1972, when Puckett allegedly killed Diane Sylvester. If the probability that applies under Definition 3 were to be admitted, it should be adjusted so that it it is not so misleadingly large. See id.; David H. Kaye, People v. Nelson: A Tale of Two Statistics, 7 L., Probability, & Risk 247 (2008).

The Birthday Problem

Also contrary to the claim in the California Lawyer, the birthday problem is not involved in Puckett. The birthday problem, in its simplest form, asks for is the smallest number of people in a room such that the probability that at least two of them will have birthdays on the same day of the same month exceeds one-half. The answer (23) is surprisingly small because no particular birthday is specified. In the Puckett search, however, a particular DNA profile -- the one from the crime-scene -- is specified. Finding that this particular profile matches at least one in the database is much less likely than finding at least one match between all pairs of profiles in the database. The latter event is the kind that is at issue in the birthday problem.  See David H. Kaye, DNA Database Woes: What Is the FBI Afraid Of? (under review). It is not involved in a cold hit to a crime-scene profile.

There are other errors in the California Lawyer article, but I hope I have said enough to caution readers to be wary. The media portrait of the database-trawl issue bears but a faint resemblance to the peer-reviewed statistical literature on the subject.

--DHK

References

Guilt by the Numbers: How fuzzy is the math that makes DNA evidence look so compelling to jurors?, California Lawyer, Apr. 2009, at 21-24.

This blog --
The Birthday Problem in Las Vegas, Aug. 11, 2008
DNA Database Woes and the Birthday Problem, July 20, 2008
Rounding Up the Usual Suspects III: People v. Nelson, June 22, 2008
The Transposition Fallacy in the Los Angeles Times, June 8, 2008
The Transposition Fallacy in Brown v. Farwell, May 3, 2008
Rounding Up the Usual Suspects II, May 5, 2008
Rounding Up the Usual Suspects, April 5, 2008

Recent law review articles

David H. Kaye, People v. Nelson: A Tale of Two Statistics, 7 L., Probability, & Risk 247 (2008)
David H. Kaye, Rounding Up the Usual Suspects: A Legal and Logical Analysis of DNA Database Trawls, 87 N. Car. L. Rev. 425 (2009)

Listed below are links to weblogs that reference Taking Liberties with the Numbers: