Friday, August 24, 2012

The arithmetic mean is not always appropriate

Although many think of statistics as a very objective field, the application of statistics to a data set requires care, otherwise the conclusions that result may be misleading. Here we provide examples of how important it is to choose a proper descriptive statistic when measuring the central tendency of a variable.

Restricting ourselves to a scalar variable $x$ (e.g., salaries, the speed of cars on a highway, body weight), let's assume that we have collected data for every member in the population in question. With our data set, the first question one usually asks is 'What is the typical value of our variable?'. Usually, one would think of the arithmetic mean, where we add up all of the numbers and divide by how many numbers present in the data set ($\frac{1}{N}\displaystyle \sum _{i=1}^N x_i$, where $x_i$ is the value of the $i$th observation or measurement of $x$). While in many cases the arithmetic mean gives a perfectly reasonable measure of the typical value in a data set, the following examples serve to break any stereotypes that the arithmetic mean is always an appropriate measure of central tendency.

Case 1: What is the typical salary of an employee at a small company X?
Here, we have the observations $x_i$ of the salary of each person in company X. The arithmetic mean aggregates all of the money that everyone makes in one year into a large pool, and then divides the money equally among each employee to determine the salary of each employee. Given the salary data in the table below, the arithmetic mean of the annual salary is around \$108,000. Is this really a reasonable measure of the central tendency for what the typical employee makes at company X? Only one out of the 14 employees makes more than half of \$108,000. The arithmetic mean is clearly not an appropriate descriptive statistic for the typical value of the annual salary at company X.


Employee Annual Salary
CEO $1,000,000
Computer Scientists (10 of them) $45,000
Accountant $30,000
Janitor $20,000
Intern $10,000

Instead, the descriptive statistic almost always used to report the central tendency of salaries is the median. The arithmetic mean is very sensitive to outliers, as this example illustrates. The median salary is one that divides the employees into two equally sized groups-- the group with those with lower salaries and those with higher salaries. Sorting the list of salaries and choosing the one in the middle, we get a median salary of \$45,000, which seems a much more reasonable 'average' salary at company X.

Case 2: What is the typical speed of a set of cars cruising on a highway?
Speed is defined as the distance traveled in a given time unit, and any reasonable average speed should reflect the aggregated distance traveled by the cars divided by the aggregated total time spent traveling among the cars. Depending on how the data is collected, the arithmetic mean may be inappropriate. Assume that each driver is cruising at a constant speed.

One way to collect data is to observe the distance $d_i$ traveled by each car on the highway after our hour of traveling. Then, the mean speed is the total distance traveled over the total time traveled among all of the cars:
$\dfrac{\displaystyle \sum_{i=1}^N d_i \mbox{ km}}{N \mbox{ traveling hours}} = \frac{1}{N} \displaystyle \sum_{i=1}^N v_i \mbox{ km/hr}$,
which is the same as the arithmetic mean of the velocity $v_i$ of each car on the highway during the hour long journey.

Another way is to measure the time $t_i$ taken by each car to travel a distance of 1 km. Then the mean speed is the total distance traveled over the total time traveled among all of the cars, but we arrive at a different formula:
$\dfrac{N \mbox{ km}}{\displaystyle \sum_{i=1}^N t_i \mbox{ traveling hours}} = \dfrac{N}{\displaystyle \sum_{i=1}^N \frac{1}{v_i} \mbox{ km/hr} }$.
The latter formula is the called the harmonic mean of the velocity. It turns out that the harmonic mean is better for finding the typical rate of a process when sampling the times that it takes to complete a rate process.

Case 3: The Human Development Index (HDI)
The Human Development Index (HDI) is "a single statistic which was to serve as a frame of reference for both social and economic development" [1] that ranks the development level of countries around the world. The index incorporates the factors: life expectancy at birth, years of education, and gross national income per capita. [Norway is #1 and the US is #4, look it up.] The old HDI was computed with an arithmetic mean to combine all of the data. However, it recently changed to use the geometric mean in combining the life expectancy, years of education, and income per capita to arrive at the amalgamated HDI.

The geometric mean is defined as:
 $GM(x_1,x_2,...,x_N)=(x_1 \cdot x_2 \cdot \cdot \cdot x_N)^{\frac{1}{N}}$.
It is called the "geometric" mean because, in two dimensions, the geometric mean of two numbers $a$ and $b$ is defined as the length of a side of a square whose area is the same as the rectangle composed of segments of length $a$ and $b$.

The reason the HDI is now computed with the geometric mean stems from a useful property of the geometric mean that does not hold for any other mean: the geometric mean is invariant to normalizations. Think about the drastic change in scale between the amount of money someone makes (e.g., 60,000) and the life expectancy (e.g., 60). The arithmetic mean of the life expectancy and income would place a greater emphasis on the differences in income between countries since a 1% change in income would be large (e.g., 600) in comparison to a 1% change in life expectancy (e.g., 0.6). The geometric mean is somewhat magical in that it "ensures that a 1% decline in index of say life expectancy at birth has the same impact on the HDI as a 1% decline in education or income index" [1] by virtue of its mathematical property: 

$GM \left(\dfrac{X_i}{Y_i} \right)=\dfrac{GM(X_i)}{GM(Y_i)}$.

A person reasonably good with numbers would attempt to normalize the data on the life expectancy, years of education, and income, and then compute the arithmetic mean. But, the normalization reference chosen is somewhat arbitrary here, and it can be shown that the ranking using the arithmetic mean changes depending on the reference value chosen for the normalization, while the ranking under a geometric mean is invariant to the normalization reference. See [2] for an example.

 [1] http://hdr.undp.org/en/statistics/hdi/ [2] http://en.wikipedia.org/wiki/Geometric_mean