Ensemble learning is a machine learning approach that combines the predictions of multiple models to obtain better performance than any individual model. You can think of many examples in your daily life, for example, many people from your class give many ideas for the farewell party to make it a success when a single person cannot manage the entire party. Here are some interesting facts about ensemble methods:
- Ensemble methods can be used for both classification and regression problems.
- The idea behind ensemble learning is to decrease bias and errors from single models by leveraging the predictive power of each model.
- There are three popular strategies for ensemble learning: bagging, boosting, and stacking.
- Bagging, or Bootstrap Aggregating, finds applications in various fields. One notable example is in weather forecasting. Meteorologists utilize bagging to predict weather conditions by training multiple models on different subsets of historical weather data. By aggregating the predictions from these models, they can stabilize and improve the accuracy of weather forecasts, thereby aiding in effective disaster management and planning. Bagging is an ensemble learning strategy in which the models are trained using random samples of the data set. This helps to reduce overfitting and improve the accuracy of the model.
- Boosting techniques are extensively used in finance, particularly in credit scoring. Financial institutions employ boosting algorithms to assess credit risk by analyzing customer data, such as transaction history, credit scores, and financial behavior. By sequentially refining the predictive models based on previously misclassified instances, banks can make more accurate credit decisions, minimizing the risk of defaults and ensuring a healthy loan portfolio. Boosting is another ensemble learning strategy that involves training weak models sequentially, with each model trying to correct the errors of the previous model. This helps to improve the accuracy of the model by reducing bias.
- Stacking, a more complex ensemble learning technique, has found application in marketing analytics. Consider a scenario where a retail company aims to optimize its marketing strategies. By leveraging stacking, the company can combine the predictions of various models trained on customer data, purchasing patterns, and market trends. The integrated insights help in tailoring targeted marketing campaigns, enhancing customer engagement, and ultimately boosting sales and revenue. Stacking is a more complex ensemble learning strategy that involves training multiple models and then using a meta-model to combine their predictions. This can lead to even better performance than bagging or boosting.
Let us see an example of how to use ensemble methods in Python. We will use the Heart Attack Analysis & Prediction dataset from Kaggle and implement the averaging ensemble method. Here is the Python code and I have attached the screenshot of my code as well. Can you quickly open the screenshot and tell me what you observe?
# Import required packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
# Load the dataset
data = pd.read_csv("C:\\Users\\Mrudula\\Downloads\\ML_lab_datasets\\heart disease.csv")
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42)
# Train a single decision tree model
single_model = DecisionTreeClassifier(max_depth=3)
single_model.fit(X_train, y_train)
single_predictions = single_model.predict(X_test)
# Calculate the accuracy of the single model
single_accuracy = accuracy_score(y_test, single_predictions)
print(f'Accuracy of the single decision tree model: {single_accuracy:.2f}')
# Train multiple decision tree models with ensemble methods
ensemble_models = []
for i in range(10):
model = DecisionTreeClassifier(max_depth=3)
sample = X_train.sample(frac=0.8, replace=True, random_state=i)
model.fit(sample, y_train.loc[sample.index])
ensemble_models.append(model)
# Make predictions on the testing set using each model
ensemble_predictions = []
for model in ensemble_models:
ensemble_predictions.append(model.predict(X_test))
# Combine the predictions using averaging
final_predictions = pd.DataFrame(ensemble_predictions).T.mean(axis=1).round().astype(int)
# Calculate the accuracy of the ensemble model
ensemble_accuracy = accuracy_score(y_test, final_predictions)
print(f'Accuracy of the ensemble model: {ensemble_accuracy:.2f}')
In this example, we trained 10 decision tree models using bagging and combined their predictions using averaging. The accuracy of the ensemble model was 0.80, which is higher than the accuracy of the individual model.
Ensemble methods are a powerful tool for improving the performance of machine learning models. By combining the predictions of multiple models, we can reduce bias and errors and obtain more accurate results.
Pic credit:
https://towardsdatascience.com/what-are-ensemble-methods-in-machine-learning-cac1d17ed349