Monday, December 24, 2012

The Principle of Maximum Entropy

The principle of maximum entropy is invoked when we have some piece(s) of information about a probability distribution, but not enough to characterize it completely-- likely because we do not have the means or resources to do so. As an example, if all we know about a distribution is its average, we can imagine infinite shapes that yield a particular average. The principle of maximum entropy says that we should humbly choose the distribution that maximizes the amount of unpredictability contained in the distribution, under the constraint that the distribution matches the average that we measured. Taking the idea to the extreme, it wouldn't be scientific to choose a distribution that simply yields the average value 100% of the time. Below, we define entropy and show how it can be interpreted as unpredictability or uninformativeness.

Take a sample space of $n$ events, where event $i$ occurs with probability $p_i$. The surprisal of event $i$ is defined as $-\log{p_i}$. Since $p_i \in [0,1]$, the surprisal runs monontonically from infinity to zero. This is intuitive because, if an event $i$ will occur with certainty ($p_i=1$), we will be zero surprised when we see it occur; if an event $i$ cannot possibly occur ($p_i=0$), we will be infinitely surprised. Why choose the logarithm as the monotonically increasing function from zero to infinity? If we have two independent events $i$ and $j$, the probability of, after two observations, seeing event $i$ and then event $j$ is $p_i p_j$. Via the famous property of the logarithm, the surprisal to see this occur is $\log{p_i p_j}=\log{p_i}+\log{p_j}$, making the surprisals additive.

Entropy is a characteristic of not an event, but of the entire probability distribution. Entropy is defined as the average surprisal in the entire distribution $<-\log{p_i}>=-\displaystyle \sum _{i=1}^n p_i \log{p_i}$. The entropy is a measure of how uninformative a given probability distribution is-- a high entropy translates to high unpredictability. Thus, maximizing entropy is consistent with maximizing unpredictability, given the little information we may know about a distribution. The most informative distribution we can imagine is where we know that an event will occur 100% of the time, giving an entropy of zero. The least informative distribution we can imagine is a uniform distribution, where each event in the sample space has an equal chance of occurring, giving an entropy of $\log{n}$. The uniform distribution is the least informative because it treats each event in the sample space equally and gives no information about one event being more likely to occur than another. Next, we show mathematically that, when we know nothing about a probability distribution, the distribution that maximizes the entropy is the uniform distribution. This is the principle of equal a priori probabilities: "in the absence of any reason to expect one event rather than another, all the possible events should be assigned the same probability" [1].

Something we always know about a probability distribution is that it must be normalized so that $\displaystyle \sum _{i=1}^n p_i =1$. So, we maximize the entropy $-\displaystyle \sum _{i=1}^n p_i \log{p_i}$ under the normalization constraint. Using a Lagrangian multiplier, we recast this problem as:

$\max \left( \displaystyle -\sum _{i=1}^n p_i \log{p_i} +\lambda ( \sum _{i=1}^n p_i-1) \right).$

Taking the derivative with respect to $p_i$ and setting it equal to zero for a maximum, we find that $p_i$ is the same for every event $i$. By the normalization constraint, this gives us a uniform distribution $p_i=\frac{1}{n}$ and an entropy of $\log{n}$. So the principle of equal a priori probabilities is really a subcase of the principle of maximum entropy!

[1] http://www.thefreedictionary.com/Principle+of+equal+a-priori+probability.