This article is part of a series of articles I’m writing for machine learning practitioners who want to brush up on the fundamentals of machine learning. I’m covering concepts, algorithms and models that are not widely known to people with limited or no formal education in statistics and probability theory. I hope it helps. This article is dedicated to Bayesian Networks. You can find more articles here.

## Motivating Example

A Bayesian network is a method for reasoning with probabilities. We can use it to define the dependence assumptions between different events or random variables. For example, the figure below shows a Bayesian network where every represents a random variable. An arrow (directed edge) from one node to another means the second node depends on the first.

The network says that on a sunny day, I will probably visit the beach. If I do so, I’m also likely going to swim. However, the probability of me swimming depends on me not feeling sick. I also like to eat fish after swimming.

**If I told you that I ate fish yesterday, what’s the probability that I also visited the beach?**

To figure this out, we need to augment the network with prior and conditional probabilities of the random variables as shown in the figure below. A prior probability measures how likely an event would occur before considering any evidence into account. A conditional probability measures the probability of an event, given that one or more other events (also called evidence) took place.

The top level nodes, *sunny* and *not-sick* do not depend on any other node. Their prior probabilities show that it’s sunny 60% of the time where I live, and that I’m not sick 95% of the time.

The lower level nodes, *visit-beach*, *swim* and *eat-fish*, each have conditional probabilities that define how likely each one of them will occur for all possible combinations of thier direct ancestor nodes. For example, the conditional probabilities for *visit-beach* shows that there’s a 30% chance I will visit the beach given that it’s a sunny day. The chance is 10% if it’s not sunny that day. The node *swim* has four conditional probability values due to the four possible combination values of its parent nodes *visit-beach* and *not-sick*.

The prior and conditional probabilities for the nodes of the network uniquely define the network. We can use them to calculate the joint probability distribution for all the variables in the network *P(sunny, vb, ns, s, ef)*. If we have the joint distribution, we can answer the question we posed earlier.

The probability of me visiting the beach if I ate fish on the same day can then be calculated by:

*P(vb | ef) = P(vb, ef)/P(ef)*

Where *P(vb,ef)* can be obtained by marginalizing (summing out) over* sunny, s, ns*, and *P(ef)* can be obtained by marginalizing over all other variables in the network.

## The Complexity of Solving Bayesian Networks

The complete specification the joint probability distribution of a Bayesian network with nodes requires *2**n – 1* values. This is a NP-hard problem. In our simple example, that would be 31 values. However, we were able to solve the network earlier with only the 10 values shown in the figure above. We were able to do that because we used certain conditional independence assumptions. The savings here were not significant. However, we can easily see how the savings in computation can turn a practically unsolvable network with hundreds of nodes into one that can be solved in a reasonable time.

The full specifications of a joint distribution requires *2**n – 1* values in networks with *n* binary random variables. Each node has two possible values, which gives *2**n* possible combinations, and the sum of the joint distribution values has to add up to one. Therefore, *2**n – 1* values are needed for a full specification of the joint probability distribution.

## Independence Assumptions

There are a few rules for the independence assumptions we make in a Bayesian network based on its structure in order to simplify it.

In a Bayesian network, **a node is conditionally independent of its non-descendants given its parents**. In our example, *swim* is conditionally independent of *sunny* given *visit-beach* and *not-sick*. That means *sunny* will not add to our knowledge of the possible value for *swim* *if* we know the values of *visit-beach* and *not-sick*.

Another independence assumption example, is that *eat-fish* is conditionally independent on *sunny*, *visit-beach* and *not-sick* given *swim*. This simplifying assumption means that *P(ef|sunny,vb,ns,s) = P(ef|s)*. This means the conditional probability table for *P(ef|sunny,vb,ns,swim)* which has *2**4=16* values, could be reduced to a conditional probability table for *P(ef|s)* with only 2 values.

### D-separation

A more formal way for making independence assumptions in a Bayesian network is using d-separation. The d in d-separation stands for directional, since we are dealing with directed graphs. Two variables *x* and *y* are said to be separated, or independent, if there is no connecting paths between them. Paths are traversed through the network edges.

#### Unconditional Separation

**Two variables x and y are d-connected if there is an unblocked path between them.**

A path is blocked by two consecutive head-to-head arrows. Two consecutive back-to-back arrows do not block a path.

For example, in the network shown below, the path from *a* to *c* is not blocked. Therefore, *a* and *c* are d-connected and not independent. The same goes for the path from *a* to *b* and the path from *a* to *e*.

On the other hand, the path from *a* to *d* is blocked by the two head-to-head arrows pointing to *c* (a collider). Therefore, *a* and *d* are d-separated and independent. Notice that we are not conditioning on any variable here.

In our original example, *sunny* and *eat-fish* are d-connected. Knowing whether it’s sunny or not on a given day affects the probability of me eating fish. On the other hand, *sunny* and *not-sick* are d-separated. Knowing whether it’s sunny or not does affect the likelihood of me not feeling sick on that day, and vice versa.

#### Blocking by Conditioning

**Two variables x and y are d-connected given a set of nodes Z, if the path between them is not blocked by two head-to-head arrows, and the path does not include any nodes in Z.**

In the graph shown below, the shaded node *b* is observed, or conditioned on. This conditioning blocks the path from *a* to *c*. Therefore, nodes *a* and *c* are d-separated. This means that variables *a* and *c* are independent given *b*. Also, *a* and *e* are independent given *b*.

In our original example, if *visit-beach* is given, then the path between *sunny* to *swim* is blocked by it. Knowing whether it’s sunny or not does not affect the likelihood of me swimming if we know that I visited the beach on that day or not.

#### Conditioning on Collider Nodes

A collider node is a node along a path with two head-to-head arrows.

**Conditioning on a collider node or any of its descendant nodes neutralizes its effect in blocking a path.**

In the network shown below, the path between *a* and *d* has a collider where two head-to-head arrows point to node *c*. However, since node *c* is observed (conditioned on), it does not block the path any more. The path could have been unblocked as well if node *e* (a descendant of *c*) were observed (conditioned on).

In our original example, the path from *sunny* to *not-sick* can be unblocked by observing *swim*. The knowledge of whether it’s sunny or not affects the likelihood of me not feeling sick if we know whether or not I swam that day.

By applying these rules of d-separation, we can use our independence assumption to significantly reduce the number of values needed to solve our network and evaluate its join probability distribution *P(a, b, c, d, e)*. Once the joint distribution is known, we can calculate the probability of any variable in the network, *x*, given the evidence or observations *E* we have (*P(x|E)*).

## Bayesian Networks Evaluation

The conditional independence assumptions mentioned earlier help us reduce the complexity of the problem, but can only take us so far. The more connected the network is, the less the savings we can get with our conditional independence assumptions. The problem is still NP-hard, and the exponential time limitation shows up in realistic networks solving real life problems.

All hope is not lost though in solving this problem. There are two classes of solutions (inference) for these networks. Exact, and approximate solutions.

### Exact Solutions

There is a special class of Bayesian networks, known as *singly connected network*, that has a relatively low degree of connectedness and can be efficiently solved in a time linear in the number of nodes in the network. In a singly connected network, there is only one path between any two nodes when we ignore the direction of arrows in the network. Our original network example satisfies this condition, and thus, is singly connected.

The network shown below, on the other hand, is not singly connected, since there are two paths connecting *b* and *e*; A direct one, and one that goes through *c*.

We can use dynamic programming for exact inference in Bayesian networks. For more details on how to do that, refer to chapter 8 of Christopher Bishop’s book: “Pattern Recognition and Machine Learning.”

### Approximate Solutions

There are multiple approximate solutions for Bayesian networks where accuracy is traded for speed. They broadly fall under two classes depending on whether they use stochastic or deterministic approximations.

Stochastic methods, like Markov chain Monte Carlo, are sampling based. They iteratively refine their solution to reduce the error and approach the exact solution after an infinite amount of time (or computational resources). These methods are computationally demanding for most realistic real life problems.

Deterministic methods, like variational inference, on the other hand, are more computationally efficient for large-scale applications. They cannot find exact solutions regardless of the computational resources available. However, their error margin is reasonable for most practical applications.

## Bayesian Networks Applications

Bayesian networks simplify the integration of expert knowledge into the model we are building when compared to other methods. They have been used widely in diagnosis problems, specially in medical diagnosis. In such an application, the network structure represents the dependence and causal relationship between diseases and symptoms. The expert, or physician’s knowledge, is incorporated into the model in the form of conditional probability values for some nodes (symptoms) given other nodes (diseases). The diagnosis problem then involves finding the probabilities of disease nodes given the observed symptom nodes.

## References

Bayesian Networks without Tears — Eugene Charniak

Pattern Recognition and Machine Learning — Christopher Bishop

A Brief Introduction to Graphical Models and Bayesian Networks — Kevin Murphy

An Introduction to Variational Methods for Graphical Models — Michael I. Jordan, et al.

Graphical Models in a Nutshell — Daphne Koller, et al.

Pomegranate, a Python library for probabilistic graphical models.