Tuesday, June 12, 2012

Simpson's Paradox

The Simpson's Paradox is a non-intuitive phenomena where a correlation that is present in several groups is the opposite of what is found when the groups are amalgamated together. The Simpson's Paradox elucidates the need to be skeptical of reported statistics that may be drastically dependent upon how the data are aggregated [1] and to be aware of lurking variables that may negate a conclusion about what causes the correlation in the data.

The most interesting example comes from a case in 1973 where UC Berkeley was sued for discrimination against women in graduate school admissions. The data of percent acceptance indisputably show that, if a male applies, it is more likely for him to be admitted than if a female applies (44% vs. 35%). At first glace, one may propose the causal conclusion that Berkeley is biased against females.

However, if we partition the data by department to investigate the most discriminatory department, we reveal that, in 4/6 of the departments, a female applicant is more likely to be accepted than a male applicant. In the remaining two departments, the disparity between men and women is not nearly as drastic as the amalgamated data above. This data refute the causal conclusion that Berkeley has a significant bias against women.

The reason for this reversal of correlation in the aggregated data set by partitioning it [Simpson's paradox] is because of a lurking variable that had not been considered when the law suit was filed, namely the department to which one applies. Let us look at the number of males and females that apply to each particular department. We see that the least competitive departments A and B are heavily dominated by male applicants, while the most competitive departments E and F are dominated by female applicants.

The reason that, in the amalgamated data, a significantly higher percentage of male applicants are accepted than women, is that females applied to more competitive departments than the males did. Thus, as a whole, it was more likely that a male applicant would be accepted to Berkeley. But, this is because, according to the data, a woman was more likely to apply to a department that has a lower average acceptance rate.

Several other examples, such as batting averages, kidney stone treatments, and birth weights, of a real-life Simpson's paradox can be found on the Wikipedia page [2] where this data were taken from.

[1] P. J. Bickel, E. A. Hammel, J. W. O'Connell. Sex Bias in Graduate Admissions: Data from Berkeley. Science 187, (4175). 1975. pp. 398-404.
[2] http://en.wikipedia.org/wiki/Simpson's_paradox


  1. I spent a few fun hours coming up with synthetic numbers so that the reversal could, with further variables, be reversed, and then spun a story around them


    1. Nice-- I thought about making up some numbers for this post too, but I wanted to use real data. The Simpson's paradox here is not perfect because departments C and E actually do have higher percentages of the women admitted than men.