Friday, August 24, 2012

The arithmetic mean is not always appropriate

Although many think of statistics as a very objective field, the application of statistics to a data set requires care, otherwise the conclusions that result may be misleading. Here we provide examples of how important it is to choose a proper descriptive statistic when measuring the central tendency of a variable.

Restricting ourselves to a scalar variable $x$ (e.g., salaries, the speed of cars on a highway, body weight), let's assume that we have collected data for every member in the population in question. With our data set, the first question one usually asks is 'What is the typical value of our variable?'. Usually, one would think of the arithmetic mean, where we add up all of the numbers and divide by how many numbers present in the data set ($\frac{1}{N}\displaystyle \sum _{i=1}^N x_i$, where $x_i$ is the value of the $i$th observation or measurement of $x$). While in many cases the arithmetic mean gives a perfectly reasonable measure of the typical value in a data set, the following examples serve to break any stereotypes that the arithmetic mean is always an appropriate measure of central tendency.

Case 1: What is the typical salary of an employee at a small company X?
Here, we have the observations $x_i$ of the salary of each person in company X. The arithmetic mean aggregates all of the money that everyone makes in one year into a large pool, and then divides the money equally among each employee to determine the salary of each employee. Given the salary data in the table below, the arithmetic mean of the annual salary is around \$108,000. Is this really a reasonable measure of the central tendency for what the typical employee makes at company X? Only one out of the 14 employees makes more than half of \$108,000. The arithmetic mean is clearly not an appropriate descriptive statistic for the typical value of the annual salary at company X.

Employee Annual Salary
CEO $1,000,000
Computer Scientists (10 of them) $45,000
Accountant $30,000
Janitor $20,000
Intern $10,000

Instead, the descriptive statistic almost always used to report the central tendency of salaries is the median. The arithmetic mean is very sensitive to outliers, as this example illustrates. The median salary is one that divides the employees into two equally sized groups-- the group with those with lower salaries and those with higher salaries. Sorting the list of salaries and choosing the one in the middle, we get a median salary of \$45,000, which seems a much more reasonable 'average' salary at company X.

Case 2: What is the typical speed of a set of cars cruising on a highway?
Speed is defined as the distance traveled in a given time unit, and any reasonable average speed should reflect the aggregated distance traveled by the cars divided by the aggregated total time spent traveling among the cars. Depending on how the data is collected, the arithmetic mean may be inappropriate. Assume that each driver is cruising at a constant speed.

One way to collect data is to observe the distance $d_i$ traveled by each car on the highway after our hour of traveling. Then, the mean speed is the total distance traveled over the total time traveled among all of the cars:
$\dfrac{\displaystyle \sum_{i=1}^N d_i \mbox{ km}}{N \mbox{ traveling hours}} = \frac{1}{N} \displaystyle \sum_{i=1}^N v_i \mbox{ km/hr}$,
which is the same as the arithmetic mean of the velocity $v_i$ of each car on the highway during the hour long journey.

Another way is to measure the time $t_i$ taken by each car to travel a distance of 1 km. Then the mean speed is the total distance traveled over the total time traveled among all of the cars, but we arrive at a different formula:
$\dfrac{N \mbox{ km}}{\displaystyle \sum_{i=1}^N t_i \mbox{ traveling hours}} = \dfrac{N}{\displaystyle \sum_{i=1}^N \frac{1}{v_i} \mbox{ km/hr} }$.
The latter formula is the called the harmonic mean of the velocity. It turns out that the harmonic mean is better for finding the typical rate of a process when sampling the times that it takes to complete a rate process.

Case 3: The Human Development Index (HDI)
The Human Development Index (HDI) is "a single statistic which was to serve as a frame of reference for both social and economic development" [1] that ranks the development level of countries around the world. The index incorporates the factors: life expectancy at birth, years of education, and gross national income per capita. [Norway is #1 and the US is #4, look it up.] The old HDI was computed with an arithmetic mean to combine all of the data. However, it recently changed to use the geometric mean in combining the life expectancy, years of education, and income per capita to arrive at the amalgamated HDI.

The geometric mean is defined as:
 $GM(x_1,x_2,...,x_N)=(x_1 \cdot x_2 \cdot \cdot \cdot x_N)^{\frac{1}{N}}$.
It is called the "geometric" mean because, in two dimensions, the geometric mean of two numbers $a$ and $b$ is defined as the length of a side of a square whose area is the same as the rectangle composed of segments of length $a$ and $b$.

The reason the HDI is now computed with the geometric mean stems from a useful property of the geometric mean that does not hold for any other mean: the geometric mean is invariant to normalizations. Think about the drastic change in scale between the amount of money someone makes (e.g., 60,000) and the life expectancy (e.g., 60). The arithmetic mean of the life expectancy and income would place a greater emphasis on the differences in income between countries since a 1% change in income would be large (e.g., 600) in comparison to a 1% change in life expectancy (e.g., 0.6). The geometric mean is somewhat magical in that it "ensures that a 1% decline in index of say life expectancy at birth has the same impact on the HDI as a 1% decline in education or income index" [1] by virtue of its mathematical property: 

$GM \left(\dfrac{X_i}{Y_i} \right)=\dfrac{GM(X_i)}{GM(Y_i)}$.

A person reasonably good with numbers would attempt to normalize the data on the life expectancy, years of education, and income, and then compute the arithmetic mean. But, the normalization reference chosen is somewhat arbitrary here, and it can be shown that the ranking using the arithmetic mean changes depending on the reference value chosen for the normalization, while the ranking under a geometric mean is invariant to the normalization reference. See [2] for an example.

 [1] [2]

Saturday, August 18, 2012

How a statistically inept jury led to a wrongful conviction

In 1999, Sally Clark of Britain was wrongly convicted of murdering her two infant sons that actually died of sudden infant death syndrome (SIDS). "SIDS is the unexpected, sudden death of a child under age 1 in which an autopsy does not show an explainable cause of death. [1]" The case was overturned a little more than three years later and Sally was released. In 2007, Sally was tragically found dead in her home due to alcohol over-intoxication.

An "expert" witness pediatrician Roy Meadow served for the case, and he is thought to have played a major role in convincing the jury of the Sally Clark's guilty verdict by making two major statistical errors in accessing the probability of Sally Clark's innocence. [In my opinion, using arguments of probability instead of hard evidence for convicting criminals is not so just, but let us proceed with this premise regardless.]

The assumption of independence.
Roy Meadow's Claim 1: Data indicate that 1/8,543 infants born from mothers in class A* die of SIDS. Sally belongs to class A. Thus, the probability of Sally's first and second child dying of SIDS is (1/8,543)(1/8,543) = 1 in 73 million.

The data say that, given a randomly selected infant born from a mother in class A, the probability that this newborn will die of SIDS is 1/8,543. The probability that Sally's first child would die of SIDS is then 1/8,543. However, the probability that her second infant would die of SIDS is not 1/8,543 (which is what Roy Meadow assumed to obtain the 1 in 73 million) because we now know something more about Sally: her first child had died of SIDS. In fact, since SIDS is likely due to genetic factors, the probability of Sally's second child dying of SIDS is almost certainly much higher than 1/8,543 because we now know Sally might carry a genetic element that predisposes her children to SIDS. Instead, Roy Meadow made the assumption that Sally's second infant dying of SIDS is completely independent of the event of her first infant's death by SIDS and declared the probability of her second son dying as 1/8,543 in calculating the 1 in 73 million probability.

The prosecutor's fallacy.
Roy Meadow's Claim 2: The probability that two infants of the same mother both naturally die of SIDS (not by murder) is very small. Thus, the probability that Sally Clark is innocent is comparably small.

For simplicity, let us neglect all other possible explanations for the death of Sally Clark's two infants and consider that either one of the two happened: (1) Sally Clark murdered both of her infants. (2) Both infants died of SIDS. Let us denote two events as:
$I$: Sally Clark is innocent.
$E$: the evidence that two of Sally's infants died is observed.

$P(E | I)$ is the probability that, given Sally did not murder her two infants, the evidence would be observed. Since we are neglecting any other explanations of the death of Sally's two infants, this corresponds to hypothesis (2). If even Ray Meadow's underestimate of $P(E|I)$ in Claim 1 were correct, we can still negate Ray Meadow's Claim 2.

$P(I | E)$ is the probability that, given the evidence, Sally Clark is innocent. $P(I | E)$ is the most important quantity that the jury would like to know for a basis in its decision and the quantity that Ray Meadow wrongly assumed to be equal to $P(E | I)$. This is precisely the prosecutor's fallacy: assuming that $P(I | E) = P(E | I)$. In words, the fallacy is that, because the likelihood of Sally's two children both dying without her murdering them is very small, the probability of Sally's innocence given the observed evidence is comparably small.

We show that they are not equal using Bayes' Theorem, derived in Why cocaine users should learn Bayes' Theorem. "Bayes Theorem allows you to separate how likely alternative explanations of an event are, from how likely it was that the event should have happened in the first place. [2]" Relating the conditional probabilities $P(E | I)$ and $P(I | E)$, we immediately see that they are not equal, but carry on to investigate how exactly they differ:

$P(I | E) = \dfrac{P(E | I) P(I) }{ P(E)}$.

Next, rewriting the probability that the evidence is observed by considering the only two ways one may observe the evidence, namely by Sally being innocent or not innocent, $P(E) = P(E | I) P(I) + P(E | \mbox{~}  I)P(\mbox{~} I)$ where $\mbox{~}$ denotes a negation so that $P(\mbox{~} I) = 1 - P(I)$ is the probability Sally is not innocent. Substitute this expression for $P(E)$ into Bayes' Theorem above to arrive at:

$P(I | E)=\dfrac{P(E | I)P(I)}{P(E | I) P(I) + P(E | \mbox{~} I) P(\mbox{~} I)}$.

Well, $P(E | \mbox{~}I)$ is the probability that the evidence would be observed given that Sally murdered her two children-- this is one. Dividing the numerator and denominator of the right hand side by $P(\mbox{~}I)$ brings in a quantity which is the ratio of the probability of an event happening to that of it not happening-- the odds.

$P(I | E) = \dfrac{P(E | I) odds(I) }{ P(E | I) odds(I) + 1}$,

and, according to the prosecutor's claim, $P(E | I)$ is very small so we can say that:

$P(I | E) \approx P(E | I) odds(I)$

$odds(I)$ is not conditioned on any evidence. Although we don't know its value, the odds that a random mother from class A (Sally is essentially this random mother chosen since we don't have any evidence on her) will not murder her two children is quite larger than one. The large $odds(I)$ quantity in this case makes $P(I | E) >> P(E | I)$-- invalidating Roy Meadow's claim 2. Just because the probability of observing the evidence if the defendant were innocent is very small, the probability of the defendant being innocent given the evidence, which may be very valuable to a jury, is not necessarily very small. An estimate in [2] estimates $P(I | E)$ to be 2/3 in Sally's case-- a far distance from the 1 in 73 million figure obtained from making two serious statistical errors that ruined someone's life.

[2] A much better (but longer) account than mine here.
*Class A: infants born from non-smoking, affluent families with a mother over 26 [2].

Friday, August 10, 2012

The Kelly Betting Stategy

The Kelly betting strategy is for optimizing the expected growth rate of an investment when making a series of bets in which one has an advantage. The strategy is well-known in economics (as well as in gambling) and plays a huge role in real-life investing. 
"the Kelly criterion is integral to the way we manage money." -Legg Mason Capital Management CEO Bill Miller [1].
As an example, consider a game where a single die is rolled. An investor is willing to make with us a series of bets on each roll that a 6 is not rolled. That is, our $k$th bet is that a 6 will be rolled. If we put down a bet for a particular roll and win, the investor will give us back $x$ times the amount we bet (this includes the money we initially put down). If we lose, the investor will keep our money. Let $p$ be the probability that we will win the bet-- in this case $\frac{1}{6}$.

Before we think of a betting strategy, we first need to confirm that the expected money that we win in one bet is positive. Otherwise, it would be improvident for us to make any bets against this investor. Taking a basis of a one dollar bet and considering that we gain \$($x-1$) with probability $p$ (a win) and lose \$1 with probability $1-p$:
E(\$ you earn in a single bet of \$1) = $p$[\$$(x – 1)$] - $(1 – p)$[\$1] $= px-1$.

We need $px>1$ for the law of large numbers to dictate that we will have a net gain after placing many bets. Thinking of $p$ as fixed, if $x$ is too small and drives $px<1$—the investor is not willing to give us a good payoff* if we win—then we are expected to lose money because the investor has an advantage by risking little of his money in the bet but getting a relatively large reward from our bet if he wins. Thinking of $x$ as fixed, we need the probability of winning a single bet, $p$, to be large enough to drive $px>1$-- if we are very unlikely to win the bet, it would be shortsighted for us to bet at all. 

Let’s assume that the investor offers us a payoff such that $px>1$. You have a bank account with $V_0$ dollars that you intend to invest. The dilemma is this: you definitely want to make bets that a 6 is rolled, as $px>1$ that guarantees you will earn a positive return after a large number of bets. If you win a bet, you should bet some of the extra money you just won in the next bet to increase the magnitude of your return; $px>1$ after all (akin to compounding interest). However, you don’t want to bet all of your investment pool each time or you will eventually lose your initial $V_0$ investment as well as any money you won up to that point when a number other than 6 is rolled (which is quite likely to happen).

At the two extremes, (1) you bet nothing for every bet $k$ and lose the opportunity to gain money (2) you bet everything for every bet $k$ and risk losing your entire savings (you certainly will eventually), after which you cannot place more bets. The Kelly betting system concerns when, for every bet $k$, your strategy is to bet a fixed fraction $\alpha$ of your current investment pool**. Clearly, there is an optimum fraction $\alpha$ of your bank account that you should bet each round in order to maximize your expected return. Let’s find it, following the derivation in the excellent book [2].

The random variable in this process is $R_k$, which we define by:
$R_k= \{ 1,\mbox{ a 6 is rolled}$
      $=\{ 0,\mbox{  a six is not rolled}$.

Let $V_k$ be the size of our investment pool after the $k$th bet ($V_0$ fits in this notation). After the first bet, we have an investment pool of $V_1= V_0 ( 1-\alpha) + V_0 \alpha x R_1$ since we keep the $V_0(1-\alpha)$ of our bank account that we did not bet and get back $V_0 \alpha x$ only if we win ($R_1=1$ in this case). That is, $V_1=V_0(1-\alpha +\alpha x R_1)$. After the second bet, $V_2=V_1(1-\alpha) +V_1 \alpha x R_2$ since we keep the $V_1(1-\alpha)$ that we did not bet and get back $V_1 \alpha x$ only if we win. We trace back to $V_0$ by our expression for $V_1$:
$V_2=V_1 (1-\alpha+ \alpha x R_2)$
$V_2=V_0 (1-\alpha+ \alpha x R_1) (1-\alpha+ \alpha x R_2)$ and if we continue:
$V_k=V_0 (1-\alpha+ \alpha x R_1) (1-\alpha+ \alpha x R_2)\cdot \cdot \cdot (1-\alpha+ \alpha x R_k)$.

This is where a trick comes in (sorry). Let use define a growth rate $G_k$ such that we can write an exponential formula for the growth of our investment pool:
$V_k=V_0 e^{kG_k}$.
Taking the natural logarithm of both sides, we find that $G_k=\frac{1}{k} \log \left(\dfrac{V_k}{V_0} \right)$. We know exactly what $\dfrac{V_k}{V_0}$ is from two expressions above! Plugging this in and using a property of logarithms, that the log of a product of terms is the sum of the log of each term, we elucidate why we defined our growth factor this way:
$G_k = \frac{1}{k} \log \left((1-\alpha+ \alpha x R_1) (1-\alpha+ \alpha x R_2)\cdot \cdot \cdot (1-\alpha+ \alpha x R_k) \right)$
     $= \frac{1}{k} \displaystyle \sum_{n=1}^{k} \log (1-\alpha +\alpha x R_n)$
The above is just the average value that $\log (1-\alpha +\alpha x R_n)$ takes on after $k$ trials. Using the law of large numbers and letting $k \rightarrow \infty$, this is the expected value of $\log (1-\alpha +\alpha x R)$. We calculate this by knowing that $R=1$ with probability $p$ and $R=0$ with probability $1-p$:
$G_k=E(\log (1-\alpha +\alpha x R))= p \log (1-\alpha +\alpha x)+ (1-p) \log (1-\alpha)$
Noting that $G_k=G_k(\alpha)$ is a function of the betting fraction $\alpha$, we differentiate the growth rate $G_k$ and set it to zero to solve for the $\alpha$ that maximizes $G_k$ (get out your paper). We finally get the optimum betting fraction in terms of the payoff for the bet and the probability of winning each bet:
$\boxed{\alpha = \dfrac{px-1}{x-1}}$.
The numerator $px-1>0$ is the expected return on a bet of one dollar and causes the optimum betting fraction to increase, which is intuitive. Of course, $x>1$ or we would be giving out money. As the investor is willing to payoff more, $\alpha$ starts to look like $p$ (take the limit as $x \rightarrow \infty$).

* the payoff odds are defined to be $x-1$, since this is really the money that you gain in the case of a win.
**investment pool is $V_0$ + whatever cash you won from all previous bets – whatever cash you lost from all previous bets. It is akin to buying stock and reinvesting dividends.

[2] Understanding Probability by Henk Tijms.