Conditional probabilities and Bayes' theorem

If we have a probability space S and two events A and B, the probability of A given B is called conditional probability, and it's defined as:

As P(A, B) = P(B, A), it's possible to derive Bayes' theorem:

This theorem allows expressing a conditional probability as a function of the opposite one and the two marginal probabilities P(A) and P(B). This result is fundamental to many machine learning problems, because, as we're going to see in this and in the next chapters, normally it's easier to work with a conditional probability in order to get the opposite, but it's hard to work directly from the latter. A common form of this theorem can be expressed as:

Let's suppose that we need to estimate the probability of an event A given some observations B, or using the standard notation, the posterior probability of A; the previous formula expresses this value as proportional to the term P(A), which is the marginal probability of A, called prior probability, and the conditional probability of the observations B given the event A. P(B|A) is called likelihood, and defines how event A is likely to determine B. Therefore, we can summarize the relation as posterior probability ∝ likelihood · prior probability. The proportion is not a limitation, because the term P(B) is always a normalizing constant that can be omitted. Of course, the reader must remember to normalize P(A|B) so that its terms always sum up to one.

This is a key concept of Bayesian statistics, where we don't directly trust the prior probability, but we reweight it using the likelihood of some observations. As an example, we can think to toss a coin 10 times (event A). We know that P(A) = 0.5 if the coin is fair. If we'd like to know what the probability is to get 10 heads, we could employ the Binomial distribution obtaining P(10 heads) = 0.5k; however, let's suppose that we don't know whether the coin is fair or not, but we suspect it's loaded with a prior probability P(Loaded) = 0.7 in favor of tails. We can define a complete prior probability P(Coin status) using the indicator functions:

Where P(Fair) = 0.5 and P(Loaded) = 0.7, the indicator ICoin=Fair is equal to 1 only if the coin is fair, and 0 otherwise. The same happens with ICoin=Loaded when the coin is loaded. Our goal now is to determine the posterior probability P(Coin status|B1, B2, ..., Bn) to be able to confirm or to reject our hypothesis.

Let's imagine to observe n = 10 events with B1 = Head and B2, ..., Bn = Tail. We can express the probability using the binomial distribution:

After simplifying the expression, we get:

We still need to normalize by dividing both terms by 0.083 (the sum of the two terms), so we get the final posterior probability P(Coin status|B1, B2, ..., Bn) = 0.04IFair + 0.96ILoaded. This result confirms and strengthens our hypothesis. The probability of a loaded coin is now about 96%, thanks to the sequence of nine tail observations after one head.

This example was presented to show how the data (observations) is plugged into the Bayesian framework. If the reader is interested in studying these concepts in more detail, in Introduction to Statistical Decision Theory, Pratt J., Raiffa H., Schlaifer R., The MIT Press, it's possible to find many interesting examples and explanations; however, before introducing Bayesian networks, it's useful to define two other essential concepts.

The first concept is called conditional independence, and it can be formalized considering two variables A and B, which are conditioned to a third one, C. We say that A and B are conditionally independent given C if:

Now, let's suppose we have an event A that is conditioned to a series of causes C1, C2, ..., Cn; the conditional probability is, therefore, P(A|C1, C2, ..., Cn). Applying Bayes' theorem, we get:

 If there is conditional independence, the previous expression can be simplified and rewritten as:

This property is fundamental in Naive Bayes classifiers, where we assume that the effect produced by a cause does not influence the other causes. For example, in a spam detector, we could say that the length of the mail and the presence of some particular keywords are independent events, and we only need to compute P(Length|Spam) and P(Keywords|Spam) without considering the joint probability P(Length, Keywords|Spam).

Another important element is the chain rule of probabilities. Let's suppose we have the joint probability P(X1, X2, ..., Xn). It can be expressed as:

Repeating the procedure with the joint probability on the right side, we get:

In this way, it's possible to express a full joint probability as the product of hierarchical conditional probabilities, until the last term, which is a marginal distribution. We are going to use this concept extensively in the next paragraph when exploring Bayesian networks.