Tuesday, June 19, 2012

Why cocaine users should learn Bayes' Theorem

Diagnostic tests for diseases and drugs are not perfect. Two common measures of test efficacy are sensitivity and specificity. Precisely, sensitivity is the probability that, given a drug user, the test will correctly identify the person as positive. Specificity is the probability that a drug-free patient will indeed test negative. Even if the sensitivity and specificity of a drug test are remarkably high, the false positives can be more abundant than the true positives when drug use in the population is low.

As an illustrative example, consider a test for cocaine that has a 99% specificity and 99% sensitivity. Given a population of 0.5% cocaine users, what is the probability that a person who tested positive for cocaine is actually a cocaine user? The answer: 33%. In this scenario with reasonably high sensitivity and specificity, two thirds of the people that test positive for cocaine are not cocaine users.

To calculate this counter-intuitive result, we need Bayes' Theorem. A geometric derivation uses a Venn Diagram representing the event that a person is a drug user and the event that a person tests positive as two circles, each of area equal to the probability of the particular event occurring when one person is tested: $P(\mbox{user})$ and $P(+)$, respectively. Since these events can both happen when a person is tested, the circles overlap, and the area of the overlapping region is the probability that the events both occur [$P(\mbox{user and }+)$].

We write a formula for the quantity that we are interested in, the probability that a person who tests positive is indeed a drug user, $P(\mbox{user} | +)$, (Read the bar as "given that". This is a 'conditional probability'.) by acknowledging that we are now only in the world of the positive test circle. The +'s that are actually drug users can be written as the fraction of the '+  test' circle that is overlapped by the 'drug user' circle:
$P(\mbox{user} | +) = \dfrac{P(\mbox{user and } +)}{ P(+)}$.

We bring the sensitivity into the picture by considering the fraction of the drug users circle that is occupied by positive test results:
$P(+ | \mbox{user}) = \dfrac{P(\mbox{user and }+)}{P(\mbox{user})}$.

Equating the two different ways of writing the joint probability $P(\mbox{user and }+)$, we derive Bayes' Theorem:
$P(\mbox{user} | +) = \dfrac{P(+ | \mbox{user}) P(\mbox{user})}{P(+)}$.

We already see that, in a population with low drug use, the sensitivity first gets multiplied by a small number. Since we do not directly know $P(+)$, we write it differently by considering two exhaustive ways people can test positive, namely by being a drug user and by not being a drug user. We weigh the two conditional events by the probability of these two different ways:
$P(+) = P(+ | \mbox{user}) P(\mbox{user}) + P(+ | \mbox{non-user}) P(\mbox{non-user})$
$= P(+ | \mbox{user}) P(\mbox{user}) + [1 - P(- | \mbox{non-user})] [1-P(\mbox{user})]$
The specificity comes into the picture and $P(+)$ can be computed by the known values as $P(+)=0.0149$. Finally, using Bayes' Theorem, we calculate the probability that a person that tests positive is actually a drug user:
$P(\mbox{user} | +) = \dfrac{(99\%) (0.5\%) }{ (1.49\%) }= 33\%$.

The reason for this surprising result is that most (99.5%) people that are tested are not actually drug users, so the small probability that the test will incorrectly identify a non-user as positive results in a reasonable number of false positives. While the test is good at correctly identifying the cocaine users, this group is so small in the population that the total number of positives from cocaine users ends up being smaller than the number of positives from non-drug users. There are important implications of this result when zero tolerance drug policies based on drug tests are implemented in the workforce.

The same idea holds for diagnostic tests for rare diseases: the number of false positives could be greater than the number of positives for people that actually have the disease.

[1] http://en.wikipedia.org/wiki/Bayes'_theorem See 'drug testing'. This is where I obtained the example.

1. I'm probably being pedantic here, but what you call selectivity is usually called sensitivity.

1. Thanks for correcting this.

2. So why should cocaine addicts learn the theorem? Did I miss something?

Based on what you're saying, it seems like random drug test enforcers are the ones who should learn the theorem...

1. The cocaine addicts should learn it presumably so that when they test positive they can use Bayes' Theorem to protest their innocence.

3. I think e meant that if you test positive, you can claim that it was a false positive?

But then they would just test you again... what are the chances of two false positives?

1. Yes, this was my justification for choosing the glamorous title.

That is a good question. The result would depend on what the actual cause of a false positive may be. If you allow me to make something up: Say two people that are not drug users tested positive. Person 1 tested positive because of a combination of something he drank and ate the night before. Person 2 tested positive because there is something chemically different about his blood (?) that makes the testing kit react the way it would if he were a drug user. Then, person 1 would probably pass the second test, but person 2 would probably fail the second test again.

4. "... most (99.5%) people that are tested are not actually drug users ..."

This is a different statement than "Given a population of 0.5% cocaine users..." since (especially in the criminal context) drug tests aren't performed on random members of the general population. What you're really assuming is "Given a population where 99.5% of drug tests are performed on non drug users."

What you really want is the proportion of test subjects who are drug users, which I think you'd have to derive from the test results themselves, controlled for the expected # of false positives.

1. That's a good point, though there are scenarios in which the tested population is close to a random sampling of the general population. For example, pre-employment drug screening, or testing to establish eligibility for life insurance.

5. A great real life problem to illustrate bayes' law.
I am suggesting Venn Pie diagrams to show the ratios:

http://oracleaide.wordpress.com/2012/12/26/a-venn-pie/