What is Ensemble Learning?

  • Why should one consider using an ensemble?
  • What are the categories of ensemble learning?
  • What are commonly used ensemble learning algorithms?

Suppose you have produced a song or written a book, would you ask one of your friends for a critic to rate it on a scale of 1 to 5, or would you just ask your colleagues or would you ask 200 people to critique your work and give it a rating on a scale from 1 to 5. This is basically what ensemble learning is, instead of making predictions based on one model, we train multiple models.

Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. Other applications of ensemble learning include assigning a confidence to the decision made by the model, selecting optimal (or near optimal) features, data fusion, incremental learning, non-stationary learning and error-correcting.

Why should we consider using an ensemble?
Ensemble methods greatly increase computational cost and complexity. This increase comes from the expertise and time required to train and maintain multiple models rather than a single model. There are two main reasons to use an ensemble over a single model:

  • Performance: An ensemble can make better predictions and achieve better performance than any single contributing model.
  • Robustness: An ensemble reduces the spread or dispersion of the predictions and model performance.

Bias, Variance and Ensembles

To know what is bias and Variance is, check out What is bias-variance trade-off?

The errors made by a machine learning model are often described in terms of two properties, namely, bias and variance.
Ideally, we would prefer a model with low bias and low variance, although in practice, this is very challenging. In fact, this could be described as the goal of applied machine learning for a given predictive modelling problem. Reducing the bias can often easily be achieved by increasing the variance. Conversely, reducing the variance can easily be achieved by increasing the bias. Some models naturally have a high bias or a high variance, which can be often relaxed or increased using hyperparameters that change the learning behaviour of the algorithm.
Ensembles provide a way to reduce the variance of the predictions. Empirical and theoretical evidence shows that some ensemble techniques (such as bagging) act as a variance reduction mechanism, i.e., they reduce the variance component of the error. Moreover, empirical results suggest that other ensemble techniques (such as AdaBoost) reduce both the bias and the variance parts of the error.

Basic Types of Ensemble Learning
Ensemble methods fall into two broad categories, i.e., sequential ensemble techniques and parallel ensemble techniques.
Sequential ensemble techniques generate base learners in a sequence, e.g., Adaptive Boosting (AdaBoost). The sequential generation of base learners promotes the dependence between the base learners. The performance of the model is then improved by assigning higher weights to previously misrepresented learners.
In parallel ensemble techniques, base learners are generated in a parallel format, e.g., random forest. Parallel methods utilize the parallel generation of base learners to encourage independence between the base learners. The independence of base learners significantly reduces the error due to the application of averages.

Commonly used Ensemble Learning Algorithms

  1. Bagging, which stands for bootstrap aggregating, is one of the earliest, most intuitive and perhaps the simplest ensemble-based algorithms, with a surprisingly good performance. The diversity of classifiers in bagging is obtained by using bootstrapped replicas of the training data. That is, different training data subsets are randomly drawn – with replacement – from the entire training dataset. Each training data subset is used to train a different classifier of the same type. Individual classifiers are then combined by taking a simple majority vote of their decisions. For any given instance, the class chosen by the most number of classifiers is the ensemble decision. Since the training datasets may overlap substantially, additional measures can be used to increase diversity, such as using a subset of the training data for training each classifier, or using relatively weak classifiers.

  1. Boosting: Similar to bagging, boosting also creates an ensemble of classifiers by resampling the data, which are then combined by majority voting. However, in boosting, resampling is strategically geared to provide the most informative training data for each consecutive classifier. In essence, each iteration of boosting creates three weak classifiers: the first classifier B1 is trained with a random subset of the available training data. The training data subset for the second classifier B2 is chosen as the most informative subset, given B1. Specifically, B2 is trained on a training data only half of which is correctly classified by B1, and the other half is misclassified. The third classifier B3 is trained with instances on which B1 and B2 disagree. The three classifiers are combined through a three-way majority vote.

  1. Stacking: Here. an ensemble of classifiers is first trained using bootstrapped samples of the training data, creating Tier 1 classifiers, whose outputs are then used to train a Tier 2 classifier (meta-classifier). The underlying idea is to learn whether training data have been properly learned. For example, if a particular classifier incorrectly learned a certain region of the feature space, and hence consistently misclassifies instances coming from that region, then the Tier 2 classifier may be able to learn this behaviour, and along with the learned behaviours of other classifiers, it can correct such improper training. Cross-validation type selection is typically used for training the Tier 1 classifiers: the entire training dataset is divided into T blocks, and each Tier-1 classifier is first trained on (a different set of) T-1 blocks of the training data. Each classifier is then evaluated on the Tth (pseudo-test) block, not seen during training. The outputs of these classifiers on their pseudo-training blocks, along with the actual correct labels for those blocks constitute the training dataset for the Tier 2 classifier.

Bagging and Boosting are two of the most commonly used techniques in machine learning. Following are a few examples:

Bagging algorithms:

  • Bagging meta-estimator
  • Random forest

Boosting algorithms:

  • AdaBoost
  • GBM
  • XGBM
  • Light GBM
  • CatBoost

Learn more: