What is overfitting, and how to mitigate it?


Overfitting: A machine learning model overfits to a dataset when it contains more parameters than can be justified by the data.

An overfitted model’s performance declines significantly when evaluated on data it has not seen before.

There are multiple ways to mitigate overfitting, depending on the model type. They include k-fold cross validation, regularization, ensembling, and early stopping of training.

Long answer

According to The Cambridge Dictionary of Statistics [page 314], an overfitted model is a model “that contain more unknown parameters than can be justified by the data.”

Overfitting is a property of a complex machine learning model that fails to generalize well on unseen data. It can happen in supervised and unsupervised learning as the model fits too well to the data to the point that it learns the noise.

Overfitting is what happens when you look at your teammates and conclude that all data scientists have to be men between the ages of 25 and 35, who never wear dress pants or exercise.

An overfitting model produces high variance error. Its performance varies a lot when presented with unseen data. However, it has low bias error. It’s predictions are not biased to a specific wrong answer.

How to detect overfitting

Test the model on unseen data. If its performance degrades significantly (10% or more) compared to how well it performed on your training data, then it’s overfitting to the noise in your training data, not the patterns you want it to learn.

How to mitigate overfitting

Do one or more of the following, if your model allows it, to alleviate overfitting.

Train on more data

If your training sample is too small and has a rich set of features, a complex model can come up with an elaborate set of rules that only holds for the few examples you have. Training on more data will expose the model to more patterns and help it generalize.

Cross-validation (CV)

If your training data set is large enough, split it into k mini data sets (say k=10). Hold out mini set #1 for testing and train your model on the combined remaining sets (2, 3, … 10). Repeat this process 9 more times while holding out mini set #2, then #3, … #10 for testing, and training on the remaining sets. Average out the model performance on the 10 held out mini test sets to get a measure of how it performs on unseen data.

Source: Wikipedia

By optimizing the model’s performance on different (although overlapping) data sets, you’re reducing its variance.


Regularization is the process of penalizing the model performance for its complexity. It adds a regularization term to the cost or loss equation that grows in value as the model complexity increases.

For example, the regularization term could be the L1 norm, L2 norm, or L-infinity norm of the model parameters vector giving us L1, L2, and L-infinity regularization.

Early stopping

Early stopping of training machine learning models to mitigate overfitting.
Source: Wikipedia

As you iteratively train your model, you tune its parameters to reduce its prediction error on the training data (blue curve above). While doing so, the model prediction error on the test data goes down as well (red curve above), until it reaches an inflection point where overfitting kicks in.

To mitigate overfitting, you need to stop training your model at this point. You need to evaluate your model’s performance on your training and test data sets at each iteration. Stop training when the error gap starts growing.

Early stopping is usually used to counter overfitting in deep learning models.

Feature pruning (selection)

One way to reduce a model’s complexity is to drop features with low predictive power and some of the features that are highly correlated from your model. This can be done manually or with an automated procedure.

Some algorithms, like Lasso regression, have built in feature selection and regularization. During model training, some feature coefficient values may gradually drop in value to zero, effectively eliminating their influence in the model output.

If the algorithm you use doesn’t have built-in feature selection capability, you can do it separately. You can use algorithms such as the Fast Correlation-Based Filter or correlation tests such as Chi-squared to identify irrelevant features so you can consider removing them.

Alternatively, you can inspect the features manually and remove extraneous features if they don’t make sense to you.


Ensembling involves combining the output of multiple base models into one. The premise is that the base models will complement each other. You will end up with an ensemble that’s more powerful than its individual models. The main types of ensembling are:

Bagging — (Bootstrap aggregation) where you train multiple strong learners in parallel. Use different training data subsets drawn using bootstrap sampling from the original training data set. You can then combine the output from the base learners using voting for classification problems, and averaging for regression problems.

Boosting — where you train multiple weak learners sequentially. You give more weight to misclassified data points when you use them to train subsequent base models. You then combine the output of the base models using weighted voting for classification problems, and weighted sum for regression problems. The weights reflect the accuracy of the base models.

Stacking — where the base models are typically of different types (e.g. decision tree, naive Bayes, logistic regression). The outputs of the base models are used as features to train a higher level model.

What is underfitting?

Underfitting is a property of a model that hasn’t picked up patterns from the data. It performs poorly on the training and test data sets.

Bayesian Networks Introduction

This article is part of a series of articles I’m writing for machine learning practitioners who want to brush up on the fundamentals of machine learning. I’m covering concepts, algorithms and models that are not widely known to people with limited or no formal education in statistics and probability theory. I hope it helps. This article is dedicated to Bayesian Networks. You can find more articles here.

Motivating Example

A Bayesian network is a method for reasoning with probabilities. We can use it to define the dependence assumptions between different events or random variables. For example, the figure below shows a Bayesian network where every represents a random variable. An arrow (directed edge) from one node to another means the second node depends on the first.

The network says that on a sunny day, I will probably visit the beach. If I do so, I’m also likely going to swim. However, the probability of me swimming depends on me not feeling sick. I also like to eat fish after swimming.

A simple Bayesian Network

If I told you that I ate fish yesterday, what’s the probability that I also visited the beach?

To figure this out, we need to augment the network with prior and conditional probabilities of the random variables as shown in the figure below. A prior probability measures how likely an event would occur before considering any evidence into account. A conditional probability measures the probability of an event, given that one or more other events (also called evidence) took place.

A simple Bayesian Network with probabilities

The top level nodes, sunny and not-sick do not depend on any other node. Their prior probabilities show that it’s sunny 60% of the time where I live, and that I’m not sick 95% of the time.

The lower level nodes, visit-beachswim and eat-fish, each have conditional probabilities that define how likely each one of them will occur for all possible combinations of thier direct ancestor nodes. For example, the conditional probabilities for visit-beach shows that there’s a 30% chance I will visit the beach given that it’s a sunny day. The chance is 10% if it’s not sunny that day. The node swim has four conditional probability values due to the four possible combination values of its parent nodes visit-beach and not-sick.

The prior and conditional probabilities for the nodes of the network uniquely define the network. We can use them to calculate the joint probability distribution for all the variables in the network P(sunny, vb, ns, s, ef). If we have the joint distribution, we can answer the question we posed earlier.

The probability of me visiting the beach if I ate fish on the same day can then be calculated by:

P(vb | ef) = P(vb, ef)/P(ef)

Where P(vb,ef) can be obtained by marginalizing (summing out) over sunny, s, ns, and P(ef) can be obtained by marginalizing over all other variables in the network.

The Complexity of Solving Bayesian Networks

The complete specification the joint probability distribution of a Bayesian network with nodes requires 2**n – 1 values. This is a NP-hard problem. In our simple example, that would be 31 values. However, we were able to solve the network earlier with only the 10 values shown in the figure above. We were able to do that because we used certain conditional independence assumptions. The savings here were not significant. However, we can easily see how the savings in computation can turn a practically unsolvable network with hundreds of nodes into one that can be solved in a reasonable time.

The full specifications of a joint distribution requires 2**n – 1 values in networks with n binary random variables. Each node has two possible values, which gives 2**n possible combinations, and the sum of the joint distribution values has to add up to one. Therefore, 2**n – 1 values are needed for a full specification of the joint probability distribution.

Independence Assumptions

There are a few rules for the independence assumptions we make in a Bayesian network based on its structure in order to simplify it.

In a Bayesian network, a node is conditionally independent of its non-descendants given its parents. In our example, swim is conditionally independent of sunny given visit-beach and not-sick. That means sunny will not add to our knowledge of the possible value for swim if we know the values of visit-beach and not-sick.

A simple Bayesian Network.

Another independence assumption example, is that eat-fish is conditionally independent on sunny, visit-beach and not-sick given swim. This simplifying assumption means that P(ef|sunny,vb,ns,s) = P(ef|s). This means the conditional probability table for P(ef|sunny,vb,ns,swim) which has 2**4=16 values, could be reduced to a conditional probability table for P(ef|s) with only 2 values.


A more formal way for making independence assumptions in a Bayesian network is using d-separation. The d in d-separation stands for directional, since we are dealing with directed graphs. Two variables x and y are said to be separated, or independent, if there is no connecting paths between them. Paths are traversed through the network edges.

Unconditional Separation

Two variables x and y are d-connected if there is an unblocked path between them.

A path is blocked by two consecutive head-to-head arrows. Two consecutive back-to-back arrows do not block a path.

For example, in the network shown below, the path from a to c is not blocked. Therefore, a and c are d-connected and not independent. The same goes for the path from a to b and the path from a to e.

On the other hand, the path from a to d is blocked by the two head-to-head arrows pointing to c (a collider). Therefore, a and d are d-separated and independent. Notice that we are not conditioning on any variable here.

Another simple Bayesian Network

In our original example, sunny and eat-fish are d-connected. Knowing whether it’s sunny or not on a given day affects the probability of me eating fish. On the other hand, sunny and not-sick are d-separated. Knowing whether it’s sunny or not does affect the likelihood of me not feeling sick on that day, and vice versa.

Blocking by Conditioning

Two variables x and y are d-connected given a set of nodes Z, if the path between them is not blocked by two head-to-head arrows, and the path does not include any nodes in Z.

In the graph shown below, the shaded node b is observed, or conditioned on. This conditioning blocks the path from a to c. Therefore, nodes a and c are d-separated. This means that variables a and c are independent given b. Also, a and e are independent given b.

A simple Bayesian Network with blocking nodes.

In our original example, if visit-beach is given, then the path between sunny to swim is blocked by it. Knowing whether it’s sunny or not does not affect the likelihood of me swimming if we know that I visited the beach on that day or not.

Conditioning on Collider Nodes

A collider node is a node along a path with two head-to-head arrows.

Conditioning on a collider node or any of its descendant nodes neutralizes its effect in blocking a path.

In the network shown below, the path between a and d has a collider where two head-to-head arrows point to node c. However, since node c is observed (conditioned on), it does not block the path any more. The path could have been unblocked as well if node e (a descendant of c) were observed (conditioned on).

Another simple network with blocking nodes.

In our original example, the path from sunny to not-sick can be unblocked by observing swim. The knowledge of whether it’s sunny or not affects the likelihood of me not feeling sick if we know whether or not I swam that day.

By applying these rules of d-separation, we can use our independence assumption to significantly reduce the number of values needed to solve our network and evaluate its join probability distribution P(a, b, c, d, e). Once the joint distribution is known, we can calculate the probability of any variable in the network, x, given the evidence or observations E we have (P(x|E)).

Bayesian Networks Evaluation

The conditional independence assumptions mentioned earlier help us reduce the complexity of the problem, but can only take us so far. The more connected the network is, the less the savings we can get with our conditional independence assumptions. The problem is still NP-hard, and the exponential time limitation shows up in realistic networks solving real life problems.

All hope is not lost though in solving this problem. There are two classes of solutions (inference) for these networks. Exact, and approximate solutions.

Exact Solutions

There is a special class of Bayesian networks, known as singly connected network, that has a relatively low degree of connectedness and can be efficiently solved in a time linear in the number of nodes in the network. In a singly connected network, there is only one path between any two nodes when we ignore the direction of arrows in the network. Our original network example satisfies this condition, and thus, is singly connected.

The network shown below, on the other hand, is not singly connected, since there are two paths connecting b and e; A direct one, and one that goes through c.

A simple network for exact inference.

We can use dynamic programming for exact inference in Bayesian networks. For more details on how to do that, refer to chapter 8 of Christopher Bishop’s book: “Pattern Recognition and Machine Learning.”

Approximate Solutions

There are multiple approximate solutions for Bayesian networks where accuracy is traded for speed. They broadly fall under two classes depending on whether they use stochastic or deterministic approximations.

Stochastic methods, like Markov chain Monte Carlo, are sampling based. They iteratively refine their solution to reduce the error and approach the exact solution after an infinite amount of time (or computational resources). These methods are computationally demanding for most realistic real life problems.

Deterministic methods, like variational inference, on the other hand, are more computationally efficient for large-scale applications. They cannot find exact solutions regardless of the computational resources available. However, their error margin is reasonable for most practical applications.

Bayesian Networks Applications

Bayesian networks simplify the integration of expert knowledge into the model we are building when compared to other methods. They have been used widely in diagnosis problems, specially in medical diagnosis. In such an application, the network structure represents the dependence and causal relationship between diseases and symptoms. The expert, or physician’s knowledge, is incorporated into the model in the form of conditional probability values for some nodes (symptoms) given other nodes (diseases). The diagnosis problem then involves finding the probabilities of disease nodes given the observed symptom nodes.


Bayesian Networks without Tears — Eugene Charniak

Pattern Recognition and Machine Learning — Christopher Bishop

A Brief Introduction to Graphical Models and Bayesian Networks — Kevin Murphy

An Introduction to Variational Methods for Graphical Models — Michael I. Jordan, et al.

Graphical Models in a Nutshell — Daphne Koller, et al.

Pomegranate, a Python library for probabilistic graphical models.

Deep Neural Networks Misclassification of Images

Deep neural networks (DNN) have been used in commercial image classification applications for sometime now with varying degrees of success. Many of these applications assign class labels and confidence scores (shirt 90%, TV 80%, book 50%, …) to the images they classify.

Most of the time, the performance of the DNN models I build is satisfactory. However, I’m really interested in understanding how a DNN may confuse non-related objects. For example, it confused running shoes for a water bottle in a dataset I was working on recently.

I used three Deep Neural Networks (DNN) models: ResNet50InceptionV3 and Xception, which are all pre-trained on the ImageNet dataset. ImageNet is a research project to develop a large image dataset with annotations, such as standard labels and descriptions. The dataset has been used in the annual ILSVRC image classification challenge. A few of the winners published their pre-trained models with the research community, and I used them here.

The three models used to classify images from ads posted on Avito. Here is how they fared. I’m not drawing conclusions based on this simple example. The objective is to show the three most likely classes, since most of the time we only see the top one.

Deep Neural Networks (CNN) classification of a shirt
Deep Neural Networks (DNN) classification of a complex image
DNN classification of a pair of sneakers
Deep Neural Networks (DNN) classification of a book cover
DNN classification of a pair of boots
Deep Neural Networks (DNN) classification of a chair
DNN classification of baby sneakers

More blog posts.

Legaltech and Machine Learning — A Good Match

In the rush to automate everything a lawyer can do, some machine learning practitioners forget tacit knowledge, which is hard to transfer into the systems they build. As Michael Polanyi puts it, “we can know more than we can tell.” Riding a bike, speaking a language, or playing a musical instrument are tasks we can perform well, yet can’t easily describe or codify. Keeping Polanyi’s observation in mind, here is a list of tasks machine learning can help Legaltech with.


Discovery is an initial phase of litigation during which parties in a dispute are required to provide each other relevant information and documents. The annual budget for discovery in the US in 2010 was approximately forty billion dollars. Sanctions for discovery violations have mushroomed over the last decade.

Most discovery work is done electronically; this is termed e-discovery. E-discovery can’t be efficiently done using simple keyword based search methods. The criteria for document relevancy are more sophisticated as they rely on the semantics and context of the documents being assessed for relevancy.

A more efficient set of approaches for e-discovery that uses machine learning is called technology-assisted review (TAR). In it, a human reviewer labels a small set of documents as relevant, and another set as irrelevant, and feed both to the system. An algorithm analyzes both sets of documents, finds correlations between their words, and determines which features affect relevance and how. It then classifies the remaining documents as relevant or irrelevant based on their similarity to both document sets.

There are multiple open source tools for software developers that greatly simplify building a professional e-discovery system in legaltech. They include Apache OpenNLP and NLTK.

Legal research

Another time and resource intensive task lawyers engage in is finding relevant cases. This involves combing through huge volumes of cases. It may result in too many relevant cases, leaving the lawyer to decide which one best supports an argument. It may also result in too few cases, which could indicate the lawyer is pursuing the wrong line of inquiry.

To mitigate this, natural language processing (NLP) and machine learning can be used to parse and analyze legal questions or documents. It then finds relevant cases and ranks them based on their relevancy score.

The tools and methods used here are similar to the ones used in e-discovery. The difference is that case documents could be uniformly structured, making them easier to process and analyze than generic documents in e-discovery.

Some machine learning based legal research platforms, like ROSS Intelligence, use NLP to narrow down the jurisdiction and date range of relevant cases. They then use part-of-speech tags and word embeddings to score the relevant documents.

Free open source software, like Stanford’s Log-linear Part-Of-Speech TaggerApache OpenNLP and NLTK can be used as the backbone of a legal research platform.

Machine Learning in Legaltech

Predicting odds of winning

One of the factors lawyers and clients consider before pursuing a case is the odds of winning. Typically, an experienced lawyer compares the potential case to similar cases litigated in the same court under similar circumstances and reviews their outcomes.

Given digitized case records and the documents and facts from a potential case, a machine learning algorithm can perform a similar analysis. For example, it can analyze existing insurance case law and predict decisions in tort liability or accident benefit cases.

There have been a few academic studies on this topic. One predicted the outcomes of case-based legal arguments with a 91% success rate. Another famous study predicted the outcomes of the US Supreme Court decisions going back to 1816. The algorithm correctly predicted 70% of the court’s 28,000 decisions. It outperformed both the popular strategy of always guessing reversal, which was correct 63% of the time, and legal experts, who were correct 66% of the time. The algorithm used 16 features in making its prediction, including the justice, the term, the issue, and the court of origin.

There have been several commercial applications in this space. Blue J Legal uses machine learning to ask its clients questions about their tax filing to “rapidly resolve in advance how courts will rule in new tax situations, based on their unique factual scenarios” using 26 different factors.

London law firm Hodge Jones & Allen uses models to assess the viability of its personal injury caseload. The features used in prediction include the claimant’s demographics, the nature and cause of the injury, and the quality of the defendant’s solicitors, among others.

Contract review

Another time consuming and possibly tedious task for law departments is contract review. Analysis of contracts can identify risks, anomalies, and future financial obligations that could be costly to omit.

Machine learning is a good candidate for this information retrieval/extraction task. It takes a list of clauses to require, accept, and reject in contracts. It then scans all contracts to see if any of these clauses’ variants exist and brings them to the attention of the reviewer.

Kira Systems and LawGeex provide commercial platforms for contract review and analysis.

Free open source software like Apache OpenNLP and NLTK can help data scientists and software engineers in building a commercial grade contract review system.

What machine learning cannot do for Legaltech

Implicit knowledge cannot be easily identified. To help us distinguish between suitable and non-suitable tasks for machine learning, Erik Brynjolfsson and Tom Mitchell identified eight key criteria for suitable tasks. They include tolerance for error, “no need for detailed explanation of how the decision was made”, and “no long chains of logic or reasoning that depend on diverse background knowledge or common sense.”

Keeping these criteria in mind, here are some examples of non-suitable machine learning tasks in Legaltech:

  1. Writing briefs
  2. Giving legal advice
  3. Negotiating deals

Where can legaltech go from here

Investments in Legaltech reached $1 billion in 2018, with about one third of that spent on artificial intelligence. This spending was largely driven by demand from lawyers to automate the mundane tasks they performed, so they could spend their time on more cognitively challenging work. Consumers of legal services also benefit from these new technologies. They are willing to pay for expensive legal advice, but not $200 an hour for routine work.

The science and technology building blocks needed to automate the long list of machine learning suitable tasks in Legaltech are available today. As more investments and entrepreneurial skills rush into this industry, the list will grow even further and faster.

More posts.