April 2024

How to build your first ensemble

April 12, 2024 02:05 PM

If you work in AI, data science, research or any similar field (or are interested in those areas), the single best thing you can do to get the best results is to use ensembles in your process. That fact was established long ago. This blog post will show you how to build your first ensemble.

Our finished result will look like this. Here are the error rates for the three models we will use, smaller is better:

> results
$Linear
[1] 6.108005

$Trees
[1] 5.478017

$Ensembles_Linear
[1] 3.776555

As you can see, the Ensembles_Linear model has the lowest error rate by far, compared to the other two results. How was that done?

1. It all starts with the data set. In this case we will use the Boston Housing data set.

library(MASS) # Need for Boston Housing data set
library(tree) # Need to make tree models

head(MASS::Boston, n = 10) # look at the first ten (out of 505) rows of the Boston Housing data set
dim(MASS::Boston)

Let’s look at the first ten rows of the data set:

Screenshot 2024-04-13 at 7.01.55 AM

We are modeling the median value of the house price, called medv in the data set.

2. Next we are going to break the data set into two groups: Train (about 80%) and test (about 20%). There is nothing special about the 80/20 split.

df <- MASS::Boston
train <- df[1:400, ] # the first 400 rows
test <- df[401:505, ] # the last 104 rows

Let’s look at the train and test sets:

Screenshot 2024-04-13 at 7.03.21 AM

3. We’ll start with the linear model. We’ll follow three steps, in this order:

A. Fit the model on the training data
B. Make predictions on the testing data
C. Calculate error rate of the predictions

# Linear model
Boston_lm <- lm(medv ~ ., data = train)

# Predictions for the linear model using the test data (required for the ensemble)
Boston_lm_predictions <- predict(object = Boston_lm, newdata = test)

# Error rate for the linear model using actual vs predicted results
Boston_lm_RMSE <- Metrics::rmse(actual = test$medv, predicted = Boston_lm_predictions)

4. We’ll do exactly the same steps with a different modeling method. We’ll use trees, but there are many other options available. We will follow the same three steps we used for the linear model:

A. Fit the model on the training data
B. Make predictions on the testing data
C. Calculate error rate of the predictions

# Tree model
Boston_tree <- tree(medv ~ ., data = train)

# Predictions for the tree model using the test data (required for the ensemble)
Boston_tree_predictions <- predict(object = Boston_tree, newdata = test)

# Error rate for the tree model using actual and predicted results
Boston_tree_RMSE <- Metrics::rmse(actual = test$medv, predicted = Boston_tree_predictions)

5. The ensemble is built using the predictions of the two methods (linear and trees). It also needs the true value of the data, and that’s the meds value in the testing data set.

# Create the ensemble
ensemble <- data.frame(
'linear' = Boston_lm_predictions,
'tree' = Boston_tree_predictions,
'y_ensemble' = test$medv
)

The ensemble looks like this:

Screenshot 2024-04-12 at 2.24.47 PM

6. Next we break the ensemble into train (80%) and test(20%).
ensemble_train <- ensemble[1:80, ]
ensemble_test <- ensemble[81:105, ]

Here is what each looks like:

The ensemble training set:
Screenshot 2024-04-12 at 2.25.42 PM

The ensemble testing set:
Screenshot 2024-04-12 at 2.26.27 PM

7. Now we’ll follow the exact same steps we used above, but with the ensemble data set.

A. Fit the model on the training data
B. Make predictions using the testing data
C. Calculate the error rate of the predictions.

# Ensemble linear modeling
ensemble_lm <- lm(y_ensemble ~ ., data = ensemble_train)

# Predictions for the ensemble linear model
ensemble_prediction <- predict(ensemble_lm, newdata = ensemble_test)

# Root mean squared error for the ensemble linear model
ensemble_lm_RMSE <- Metrics::rmse(actual = ensemble_test$y_ensemble, predicted = ensemble_prediction)

8. Let’s put it all together, and look at the results. These are error rates, the smaller the better:

results <- list(
'Linear' = Boston_lm_RMSE,
'Trees' = Boston_tree_RMSE,
'Ensembles_Linear' = ensemble_lm_RMSE
)

9. Let’s look at the results. Keep in mind these are error rates, so lower is better.

Screenshot 2024-04-12 at 2.28.30 PM

10. Last, let’s check for any errors or warnings:

warnings() # There are no warnings returned.

See if you can build your own ensemble. You may use the Boston Housing data set, or any other data set you wish. Future blog posts will highlight more examples and ways to make ensembles, and show how ensembles produce excellent results.

Complete code for this blog post:

https://raw.githubusercontent.com/InfiniteCuriosity/HighestAccuracy/master/Most_basic_ensemble_demo.R