HW 3 part 1

HW 3 part 1#

Ensembling: Foundations

—TODO your name here

Collaboration Statement

TODO brief statement on the nature of your collaboration.
TODO your collaborators’ names here

Part 1 Table of Contents and Rubric#

Section	Points
Variance reduction with ensembling	1.25
Decision trees and bias-variance	1.25
Implementing tree ensembles	3
Total	5.5 pts

Notebook and function imports#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator
from sklearn.tree import DecisionTreeRegressor
from typing import Self
from ipywidgets import interact_manual

# random number generator
rng = np.random.RandomState(42)

1. Variance and ensembles [1.25 pts]#

In Class 12, we saw that random forests combine many trees and that “individual errors cancel out” when predictions are independent. In this section we’ll make that idea precise using variance, a quantity that measures how spread out a set of values is.

1.1 Variance definition [0.25 pts]#

In Worksheet 2, we saw the expectation of a random variable $Y$ is the average value of $Y$ over all possible outcomes. It is given by:

\[ E[Y] = \sum_{y \in \mathcal{Y}} y P(Y=y) \]

The variance of a random variable $Y$ measures how spread out the values of $Y$ are around its mean. It is defined as:

\[ \text{Var}(Y) = E\left[(Y - E[Y])^2\right] \]

When we have a finite set of $n$ values $y_1, y_2, \ldots, y_n$ (where each value is equally likely), this simplifies to:

\[ \text{Var}(Y) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2 \]

where $\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$ is the mean.

To make this concrete, suppose we have trained four different decision trees $f_1, f_2, f_3, f_4$, each time on a different random subset of the training data to predict housing prices. For a single test house with true value $y=6$, the four trees make the following predictions:

\[ f_1(x) = 3, \quad f_2(x) = 9, \quad f_3(x) = 5, \quad f_4(x) = 7 \]

What is the mean prediction $\bar{f}$ across the four trees?

\[ \bar{f} = TODO \]

What is the variance of the predictions?

\[ \text{Var}(f) = TODO \]

1.2 Why bagging reduces variance [1 pt]#

The core idea behind bagging (and random forests) for regression tasks is that averaging multiple independent predictions produces a more stable result than relying on any single prediction. This averaging for regression is equivalent to the majority voting we saw in classification.

Let’s see why this is the case mathematically. Suppose we have $T$ independent random variables $Y_1, Y_2, \ldots, Y_T$, each with the same variance $\sigma^2$:

\[ \text{Var}(Y_t) = \sigma^2 \quad \text{for all } t = 1, 2, \ldots, T \]

Note that this $\sigma^2$ term represents the “inherent” variability of a single random variable. In the ensemble context, each $Y_t$ represents the prediction of a different tree. We want to find the variance of the average of these predictions $\overline{Y}$:

\[ \overline{Y} = \frac{1}{T}\sum_{t=1}^{T} Y_t \]

To complete our derivation, we’ll use the following two properties of variance:

Note

Scaling: For any constant $c$ and random variable $Y$:

\[\text{Var}(cY) = c^2 \, \text{Var}(Y)\]

Sum of independent variables: If $X$ and $Y$ are independent:

\[\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\]

This extends to $T$ independent variables: $\text{Var}(Y_1 + Y_2 + \cdots + Y_T) = \text{Var}(Y_1) + \text{Var}(Y_2) + \cdots + \text{Var}(Y_T)$

1.2.1 Complete the derivation of the variance of the average of the predictions by filling in each step. We start with:

\[\begin{split} \text{Var}(\bar{Y}) &= \text{Var}\left(\frac{1}{T}\sum_{t=1}^{T} Y_t\right) \\ &= \text{TODO step 1} \\ &= \text{TODO step 2} \\ &= \text{TODO step 3} \end{split}\]

Derivation tips

For step 1, use the scaling property of variance.
For step 2, use the sum of independent variables property of variance.
For step 3, simplify the summation using the fact that the variance of each $Y_t$ is $\sigma^2$.

Click to check derivation

After applying the variance properties and simplifying the summation, we get:

\[ \text{Var}(\bar{Y}) = \frac{\sigma^2}{T} \]

1.2.2 Using this result for $\text{Var}(\bar{Y})$, suppose that the variance of a single decision tree is $\sigma^2 = 5$. What would the variance of an ensemble of $T = 4$ independent trees be? What about $T = 100$ trees?

Ensemble variance with $T = 4$: TODO
Ensemble variance with $T = 100$: TODO

1.2.3 Based on your results above, what happens to the variance of the ensemble prediction as the number of trees $T$ increases? What does that tell us about the “stability” of predictions made by an ensemble?

Your response: TODO

2. Decision trees and bias-variance [1.25 pts]#

We’ll now fit our own decision trees to verify these results empirically. First, let’s get familiar with sklearn’s implementation of decision trees. Specifically, sklearn provides two decision tree classes:

DecisionTreeClassifier for classification (predicting categories)
DecisionTreeRegressor for regression (predicting continuous values)

Both follow the same .fit() / .predict() interface we’ve used and practiced all semester. The key hyperparameter is max_depth, which controls the maximum depth of the tree:

from sklearn.tree import DecisionTreeRegressor

# max_depth controls complexity: deeper = more complex
tree = DecisionTreeRegressor(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

A few things to note:

max_depth=None (the default) means the tree grows until no more splits are possible that reduce the loss function. This can memorize the training data and overfit.
max_depth=1 creates a “decision stump”: a tree with a single split. Very simple, but often underfits. Decision stumps are often used as the “base model” in boosting algorithms.
random_state controls the random number generator. When there are ties between equally good splits, sklearn breaks them randomly, so setting random_state ensures reproducibility.
No StandardScaler is needed to prepare the data. Unlike logistic and linear regression, decision trees split on raw feature values and don’t rely on gradient descent, so scaling doesn’t change the result.

Regression vs. Classification

Decision tree regressors predict continuous values (house prices). At each leaf, the prediction is the mean of the training examples that fall into that leaf.

Decision tree classifiers predict categories (penguin species). At each leaf, the prediction is the majority class, and .predict_proba() returns the proportion of each class.

We’ll use DecisionTreeRegressor in this section to connect with the variance math from the questions above.

2.1 Visualizing decision trees [0.25 pts]#

Let’s practice the API with a small housing dataset: predicting house prices from the number of rooms. Complete the cell below to fit a depth-1 decision stump and a deep tree with no depth limit:

if __name__ == "__main__":
    from sklearn.tree import DecisionTreeRegressor

    # A small toy dataset: 6 houses with one feature (number of total rooms)
    X_toy = np.array([[2], 
                      [3], 
                      [4], 
                      [6], 
                      [7], 
                      [8]])
    y_toy = np.array([1, 2, 3, 5, 7, 6])

    # TODO initialize and fit a depth-1 stump
    stump = None
    stump_predictions = None
    #print("Stump predictions:", stump_predictions)

    # TODO initialize and fit a deep tree (no depth limit)
    deep = None
    deep_predictions = None
    #print("Deep tree predictions:", deep_predictions)

Visualizing decision trees with `plot_tree`#

sklearn provides plot_tree to visualize the splits and predictions of a fitted tree. It works with matplotlib’s object-oriented interface that we practiced in Worksheet 3: we create a fig and ax using plt.subplots(), then pass the ax to plot_tree.

Some useful parameters:

feature_names=feature_names: a list of feature name strings, so the plot shows the name of the features
fontsize: controls the text size inside each node
filled=True: colors each node by its predicted value (darker = higher)
impurity=False: hides the splitting metric (called “impurity” in the sklearn docs) to reduce visual clutter
ax: a matplotlib axis object, so the plot is drawn on the axis

We add an additional variation to the plt.subplots() call to create a figure with two subplots side by side: by specifying the first two parameters as 1, 2, we create a figure with 1 row and 2 columns of subplots:

# first parameter is the number of rows, second is the number of columns
fig, axs = plt.subplots(1, 2, figsize=(14, 4))

Then, we can access each axis by indexing into axs:

axs[0] # left subplot
axs[1] # right subplot

Complete the cell below to plot the stump and deep tree side by side using this approach. You can adjust the fontsize parameter to make the plot more readable by setting it to a smaller value if there are too many nodes.

if __name__ == "__main__":
    from sklearn.tree import plot_tree
    import matplotlib.pyplot as plt

    feature_names = ["# of rooms"]

    # Create a figure with two subplots
    fig, axs = plt.subplots(1, 2, figsize=(14, 4))


    # TODO plot the stump you fitted in the previous cell on the left (axs[0])
    #plot_tree(TODO)
    axs[0].set_title("Stump (max_depth=1)")

    # TODO plot the deep tree you fitted in the previous cell on the right (axs[1])
    #plot_tree(TODO)
    axs[1].set_title("Deep tree (max_depth=None)")

    plt.show()

Click to check sample plot

Your calls to plot_tree that place the stump on axs[0] and the deep tree on axs[1] should look approximately like the following:

2.2 Bootstrap Sampling [0.25 pts]#

A key ingredient in ensemble methods is bootstrap sampling: drawing $n$ examples with replacement from a dataset of $n$ examples. Because we sample with replacement, each bootstrap sample contains some duplicate rows and omits others.

Why do we want duplicates? The idea here is that by treating the original dataset as a “true” population to sample from, drawing samples with replacement gives us a way to simulate “training on different datasets” when we only have one dataset. Each bootstrap sample is a slightly different view of the data, which leads to slightly different trees, which is exactly the diversity we need for ensembling.

To draw a bootstrap sample in numpy, we can use rng.choice to sample indices with replacement, then use those indices to select the rows of X and y:

seed = 42
# Initialize a random number generator with a specific seed for reproducibility
rng = np.random.RandomState(seed)

# setting replace=True means we can sample the same index multiple times (with replacement)
# n is the number of examples in the dataset
bootstrap_idx = rng.choice(n, size=n, replace=True)  
X_bootstrap = X[bootstrap_idx]
y_bootstrap = y[bootstrap_idx]

Visit the NumPy documentation for rng.choice to understand what all three parameters shown in the code above do, then implement the bootstrap_sample function below. We’ll reuse this function throughout the rest of part 1.

def bootstrap_sample(X: np.ndarray, y: np.ndarray, random_state: int) -> tuple[np.ndarray, np.ndarray]:
    """Create a bootstrap sample from feature matrix X and targets y.

    Args:
        X: feature 2D array of shape (n, p)
        y: target 1D array of shape (n,)
        random_state: random seed for reproducibility

    Returns:
        X_boot: bootstrapped features of shape (n, p)
        y_boot: bootstrapped targets of shape (n,)
    """

    # TODO your code here
    return None, None

if __name__ == "__main__":
    # Test with small data
    X_small = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
    y_small = np.array([10, 20, 30, 40])

    X_boot, y_boot = bootstrap_sample(X_small, y_small, random_state=0)

    assert X_boot.shape == X_small.shape, f"Expected shape {X_small.shape}, got {X_boot.shape}"
    assert y_boot.shape == y_small.shape, f"Expected shape {y_small.shape}, got {y_boot.shape}"

    # Verify reproducibility
    X_boot2, y_boot2 = bootstrap_sample(X_small, y_small, random_state=0)
    assert np.array_equal(X_boot, X_boot2), "Same seed should give same result"

pandas equivalent

If you’re working with a pandas DataFrame instead of numpy arrays, you can use DataFrame.sample() to generate a bootstrap sample:

boot_df = df.sample(n=len(df), replace=True, random_state=42)

2.3 Bias and Variance [0.75 pts]#

Based on the tree plots from 2.1, we can see that a stump (max_depth=1) and a deep tree (max_depth=None) make very different predictions. But they make errors in different ways:

The stump will only ever predict two possible values. It’s consistently off with predictions that are too “coarse” because a single split can’t capture the full pattern.
The deep tree will predict the exact values of the training data. But if we trained it on a different random sample, it would memorize that sample and give very different predictions.

These two types of errors can be quantified by two concepts:

Bias measures how far off a model’s average prediction is from the truth. A model with high bias is consistently wrong in the same direction: like the stump, which systematically oversimplifies. Mathematically, for a single datapoint $(\vec{x}, y)$ and a given model $f$ with fixed hyperparameters, the bias is expressed as:

\[ \text{Bias} = \bar{f}(\vec{x}) - y \]

Where $\bar{f}(\vec{x})$ is the mean model output over the given predictions.

Variance measures how much a model’s predictions fluctuate when trained on different data. A model with high variance is “unstable”: like the deep tree, which gives very different answers depending on the training set. Mathematically, we use the same variance calculation from the previous section, this time over all of the model predictions:

\[ \text{Var} = \frac{1}{n}\sum_{i=1}^{n}\left(f(\vec{x}_i) - \bar{f}(\vec{x}_i)\right)^2 \]

This choice of shallow vs deep tree illustrates the bias-variance tradeoff that is present in many machine learning models: simpler models (stumps) tend to have high bias and low variance, while complex models (deep trees) tend to have low bias and high variance.

Let’s explore these concepts empirically. Complete the compute_bias_variance function below. Note: you do not need to implement the variance equation from scratch, as it is already implemented in numpy as np.var.

def compute_bias_and_variance(predictions: np.ndarray, y_true: float) -> tuple[float, float]:
    """Compute the bias and variance of predictions for a single test point.

    Args:
        predictions: array of shape (n_bootstrap,) containing predictions
                     from models trained on different bootstrap samples
        y_true: the true value for this test point

    Returns:
        bias: the difference between mean prediction and true value
        variance: the variance of the predictions
    """
    # TODO your code here
    bias = None
    variance = None

    return bias, variance

if __name__ == "__main__":
    # Test with a simple known example
    test_preds = np.array([3.0, 5.0, 4.0, 4.0])
    bias, var = compute_bias_and_variance(test_preds, y_true=6.0)
    assert np.isclose(bias, -2.0), f"Expected bias=-2.0, got {bias}"
    assert np.isclose(var, 0.5), f"Expected variance=0.5, got {var}"

    # Test with constant predictions (zero variance)
    const_preds = np.array([3.0, 3.0, 3.0])
    bias_c, var_c = compute_bias_and_variance(const_preds, y_true=5.0)
    assert np.isclose(bias_c, -2.0), f"Expected bias=-2.0, got {bias_c}"
    assert np.isclose(var_c, 0.0), f"Expected variance=0.0, got {var_c}"

2.3.1 We will use our bootstrap sampling method to quantify the bias and variance of the stump and deep tree. Complete the cell below to take 50 bootstrap samples of X_train_sim and y_train_sim, train a stump and deep tree on each sample, then compute the bias and variance of the predictions on a single test point where the price of the house is $400k: y=4.

if __name__ == "__main__":
    # Generates some synthetic regression data: 100 houses, 5 features
    rng_sim = np.random.RandomState(42)
    X_train_sim = rng_sim.uniform(1, 10, size=(100, 5))
    y_train_sim = (2 * np.sin(X_train_sim[:, 0]) + 0.5 * X_train_sim[:, 0]
                   + 3.4 + rng_sim.randn(100) * 0.8)

    x_test_point = np.array([[5, 5, 5, 5, 5]])  # a new house with features [5, 5, 5, 5, 5]
    y_true_point = 4.0              # true value for this house

    # store the list of predictions for each model
    stump_predictions = []
    deep_predictions = []

    n_bootstrap = 50

    # loop over 50 bootstrap samples
    for i in range(n_bootstrap):

        # TODO take a bootstrap sample of the training data X_train_sim and y_train_sim
        # NOTE: make sure to pass the random_state=i to bootstrap_sample() so that the seed is different
        bootstrap_result = None  # X_boot, y_boot = bootstrap_sample(...)

        # TODO train a stump on the bootstrap sample

        # TODO add the stump's prediction for this bootstrap sample to the list
        stump_predictions.append(None)

        # TODO train a deep tree on the bootstrap sample

        # TODO add the deep tree's prediction for this bootstrap sample to the list
        deep_predictions.append(None)

    # TODO call compute_bias_and_variance() with the list of predictions and the true value for the test point
    stump_bias, stump_variance = 0, 0
    deep_bias, deep_variance = 0, 0

    print(f"Stump bias: {stump_bias:.2f}, Stump variance: {stump_variance:.2f}")
    print(f"Deep tree bias: {deep_bias:.2f}, Deep tree variance: {deep_variance:.2f}")

2.3.2 Which model has higher bias and which has higher variance?

Higher bias: TODO
Higher variance: TODO

Most non-neural network models need to balance bias and variance to achieve good overall performance with lower loss. In fact, if we use our usual MSE loss function it can be shown that the overall error of a model can be decomposed into bias and variance terms:

\[ \text{MSE} \approx \text{Bias}^2 + \text{Var} \]

Compute the approximate MSE of the stump and deep tree using the bias and variance values from the previous cell, making sure to square the bias values:

Stump MSE: TODO
Deep tree MSE: TODO

Based on these empirical results on this small synthetic dataset, does a more complex or a simpler model appear to perform better in terms of overall loss?

Your response: TODO

Click to check takeaway

Exact numbers will vary, but for this small synthetic dataset, the deep tree has lower MSE, meaning it performs better in terms of overall loss. This is despite the fact that the deep tree has higher variance. In practice, if we wanted to use a decision tree, we would tune the max depth hyperparameter through cross-validation to find the sweet spot between bias and variance that minimizes overall loss.

Generally in modern machine learning, the overall engineering approach is to build a complex model that can appropriately fit the relationship between features and targets, then use regularization to control the variance. This approach carries forward into deep learning with neural networks.

3. Implementing Tree Ensembles [3 pts]#

Now that we’ve explored the variance of trees mathematically (Section 1) and empirically (Section 2), let’s implement the bagging and boosting ensemble strategies from scratch using DecisionTreeRegressor.

3.1 Data Setup#

Run the cell below to generate a synthetic dataset for us to use in this section, with our usual names of X_train, X_test, y_train, and y_test. There are 80 data examples in the training set and 20 in the test set.

if __name__ == "__main__":
    from sklearn.model_selection import train_test_split

    rng_sim = np.random.RandomState(42)
    X_all = rng_sim.uniform(1, 10, size=(100, 5))
    y_all = (2 * np.sin(X_all[:, 0]) + 0.5 * X_all[:, 0]
             + 3.4 + rng_sim.randn(100) * 1.0)

    X_train, X_test, y_train, y_test = train_test_split(
        X_all, y_all, test_size=0.2, random_state=42
    )
    feature_names_sim = ["rooms", "lot_size", "age", "dist_school", "neighborhood"]
    print(f"# training examples: {X_train.shape[0]}, # test examples: {X_test.shape[0]}")

3.2 Tree Bagging [1.5 pts]#

In Section 1.2, we showed that averaging $T$ independent predictions reduces variance. Let’s now implement that idea directly with a bagging ensemble.

Bagging model:

fit(): For each tree $t = 1, \ldots, T$: draw a bootstrap sample using bootstrap_sample(), then train a DecisionTreeRegressor(max_depth=max_depth, random_state=t) model on it.
predict(): average all $T$ trees’ predictions to produce the final prediction.

Read through the MHCBaggingRegressor docstring and __init__ method below. Then, complete the fit() and predict() methods.

Note

This is exactly the mechanism behind RandomForestRegressor in sklearn. The only difference is that RandomForestRegressor also adds random feature subsets on top of bootstrap sampling.

class MHCBaggingRegressor(BaseEstimator):
    """Bagging ensemble of decision trees for regression.
    
    This class implements bootstrap aggregating (bagging) by training multiple
    decision trees on bootstrap samples and averaging their predictions.
    """
    
    def __init__(self, n_trees: int = 50, max_depth: int | None = None):
        """Initialize the bagging regressor.

        Args:
            n_trees: number of trees T to train
            max_depth: max depth for each tree (None = unlimited)
        """
        
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.trees = []
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
        """Fit the bagging ensemble on training data.
        
        Args:
            X: training features, shape (n_train, p)
            y: training targets, shape (n_train,)
            
        Returns:
            self: the fitted model
        """
        # TODO for each tree t = 1, ..., T:
        for t in range(self.n_trees):
            # Draw a bootstrap sample using `bootstrap_sample()`
            X_boot, y_boot = None, None
            
            # Fit a decision tree on the bootstrap sample with max_depth=self.max_depth and random_state=t
            tree = None

            # Add the tree to the list of trees
            self.trees.append(tree)
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict using the bagging ensemble.
        
        Args:
            X: features to predict on, shape (n, p)
            
        Returns:
            predictions: averaged predictions from all trees, shape (n,)
        """

        # Generates a 2D array of shape (n, T) where each column contains the predictions from each of the T trees on the test data
        all_preds = np.column_stack([tree.predict(X) for tree in self.trees])

        # TODO take the mean of the predictions over all trees
        return None

Hint

For the predict() method, we have to pay attention to which axis to reduce. The line of starter code we give creates an array called all_preds where each column contains the predictions from each of the $T$ trees on the test data. If all_preds has shape (n, T), and the return value of this function is a 1D array of shape (n,), which axis do we need to take the mean over?

if __name__ == "__main__":
    # Test cases
    bagging_regressor = MHCBaggingRegressor(n_trees=50, max_depth=None)
    bagging_regressor.fit(X_train, y_train)
    
    assert len(bagging_regressor.trees) == 50, "Expected 50 trees to be fit, got {}".format(len(bagging_regressor.trees))

    y_pred = bagging_regressor.predict(X_test)
    assert y_pred.shape == (X_test.shape[0],), "Expected predictions to have shape (n,), got {}".format(y_pred.shape)

Now let’s explore how the number of trees and tree depth affect bagging performance. Run the cell below to fit an MHCBaggingRegressor for each combination of n_trees and max_depth:

n_trees: 1, 5, 50
max_depth: 1, None

if __name__ == "__main__":
    from sklearn.metrics import mean_squared_error

    n_trees_list = [1, 5, 50]
    max_depth_list = [1, None]

    df_dict = {
        "n_trees": [],
        "max_depth": [],
        "test_mse": []
    }

    for max_depth in max_depth_list:
        for n_trees in n_trees_list:
            model = MHCBaggingRegressor(n_trees=n_trees, max_depth=max_depth)
            model.fit(X_train, y_train)
            test_mse = mean_squared_error(y_test, model.predict(X_test))
            df_dict["n_trees"].append(n_trees)
            df_dict["max_depth"].append(str(max_depth))
            df_dict["test_mse"].append(test_mse)

    # render the results as a pandas DataFrame
    display(pd.DataFrame(df_dict))

3.2.1 Look at the max_depth=None rows. What happens to test MSE as n_trees increases?

Your response: TODO

3.2.2 Now look at the max_depth=1 rows. Briefly discuss what you observe: does adding more stumps help in the same way? Why might that be?

Your response: TODO

Finally, let’s connect these results back to the bias-variance tradeoff from Section 2. Run the cell below to fit your MHCBaggingRegressor with n_trees=50 that computes the bias and variance of each tree’s prediction for a single test point using your compute_bias_and_variance() function from earlier:

if __name__ == "__main__":
    # Same test point from Section 2.3
    x_test_point = np.array([[5, 5, 5, 5, 5]])
    y_true_point = 4.0

    n_trees = 50

    print(f"n_trees={n_trees} ensemble:")
    for max_depth in [1, None]:
        model = MHCBaggingRegressor(n_trees=n_trees, max_depth=max_depth)
        model.fit(X_train, y_train)

        # Extract each tree's individual prediction for the test point
        per_tree_preds = np.array([tree.predict(x_test_point)[0] for tree in model.trees])
        bias, variance = compute_bias_and_variance(per_tree_preds, y_true_point)

        test_mse = mean_squared_error(y_test, model.predict(X_test))

        print(f"\tmax_depth={max_depth}: " f"bias={bias:.3f}, var={variance:.3f}, MSE={test_mse:.3f}")

3.2.2 Given the results from the above cell and the math we did in Section 1.2, explain in 1-2 sentences why bagging appears to help the deep tree ensemble much more than the stump ensemble in terms of overall test MSE.

Your response: TODO

3.3 Tree boosting [0.75 pts]#

In Class 13, we computed gradient boosting by hand using decision stumps. Let’s implement it here using the DecisionTreeRegressor class.

Algorithm:

Initialize train and test predictions with the mean of the training targets:

\[ \hat{y}_\text{train} = \bar{y}_\text{train} \]

\[ \hat{y}_\text{test} = \bar{y}_\text{train} \]

For $b = 1, 2, \ldots, B$ boosting rounds:

Compute the training residuals from the current ensemble: $$r = y_\text{train} - \hat{y}_\text{train}$$
Fit a decision stump $f_b$ to predict the residuals on the training data $(x_i, r_i)$.
Update the ensemble’s training and testing predictions with learning rate $\eta$:

\[\hat{y}_\text{train} = \hat{y}_\text{train} + \eta * f_b(X_\text{train})\]

\[\hat{y}_\text{test} = \hat{y}_\text{test} + \eta * f_b(X_\text{test})\]

Additionally, compute the training and testing MSE at each round to track boosting performance.

Note

Unlike bagging, gradient boosting does not use bootstrap sampling. Each stump is trained on the full training set, but targeting the current residuals.

Complete the function below that translates the above algorithm into code:

from sklearn.metrics import mean_squared_error

def gradient_boosting_regressor(
    X_train: np.ndarray, 
    y_train: np.ndarray, 
    X_test: np.ndarray, 
    y_test: np.ndarray, 
    n_rounds: int = 50, 
    learning_rate: float = 0.5,
) -> tuple[list[float], list[float]]:
    """Implement gradient boosting for regression using decision stumps (depth-1 trees).

    Args:
        X_train: training features, shape (n_train, p)
        y_train: training targets, shape (n_train,)
        X_test: test features, shape (n_test, p)
        y_test: test targets, shape (n_test,)
        n_rounds: number of boosting rounds B
        learning_rate: step size eta

    Returns:
        train_mse_list: list of train MSE at each round
        test_mse_list: list of test MSE at each round
    """

    # Make the initial predictions of mean of training targets, with shape (n_train,) and (n_test,) respectively
    train_preds = np.array([np.mean(y_train)] * len(y_train))
    test_preds = np.array([np.mean(y_train)] * len(y_test))


    # Track the train and test MSE at each round
    train_mse_list = []
    test_mse_list = []

    # TODO complete the boosting loop according to the algorithm above
    for b in range(n_rounds):
        # Update the training residuals from the current train predictions
        residuals_train = None

        # Fit a decision stump to predict residuals_train
        stump = None

        # Update the ensemble's training and testing predictions with learning rate
        # NOTE: make sure to call predict() on X_train for the train preds, and X_test for the test preds
        train_preds = None
        test_preds = None

        # Compute the training and testing MSE at each round to track boosting performance
        train_mse_list.append(mean_squared_error(y_train, train_preds))
        test_mse_list.append(mean_squared_error(y_test, test_preds))
        
    return train_mse_list, test_mse_list

if __name__ == "__main__":
    # Small example that can be computed by hand
    X_train_small = np.array([[1], 
                              [3]])
    y_train_small = np.array([2.0, 6.0])
    X_test_small = np.array([[2.5]])
    y_test_small = np.array([6.0])

    train_mse_list, test_mse_list = gradient_boosting_regressor(
        X_train_small, y_train_small, X_test_small, y_test_small, n_rounds=2, learning_rate=0.5
    )

    # Round 1 predictions:
    # Initial predictions are 4.0 (mean of y_train_small: (2+6)/2 = 4)
    # Residuals: [2-4, 6-4] = [-2, 2]
    # Decision stump with max_depth=1 will split at X=2.0
    # Left (X<=2.0): mean residual = -2
    # Right (X>2.0): mean residual = 2
        
    assert len(train_mse_list) == 2, "Should have 2 train MSE values"
    assert len(test_mse_list) == 2, "Should have 2 test MSE values"
    assert np.isclose(train_mse_list[0], 1.0, atol=0.01), f"Train MSE round 1 should be 1, got {train_mse_list[0]}"
    assert np.isclose(test_mse_list[0], 1.0, atol=0.01), f"Test MSE round 1 should be 1, got {test_mse_list[0]}"

3.4 Boosting MSE Widget [0.75 pts]#

Let’s now visualize the train and test MSE over boosting rounds. Complete the explore_boosting_mse() function below by plotting the train and test MSE using ax.plot(boosting_rounds, mse_list). Include:

both training and test MSE curves with labels
a legend
appropriate axis labels.

Then also add an interact_manual() decorator for the learning rate parameter:

learning_rate=(0.1, 1.0, 0.1)

You can refer back to the matplotlib quickstart guide as well as how we used the interact_manual() decorator in Worksheet 3 as references.

if __name__ == "__main__":
    # TODO add interact_manual() decorator for the learning rate parameter
    @interact_manual(learning_rate=(0.1, 1.0, 0.1))
    def explore_boosting_mse(learning_rate):
        n_rounds = 40
        train_mse_list, test_mse_list = gradient_boosting_regressor(X_train, y_train, X_test, y_test, n_rounds=n_rounds, learning_rate=learning_rate)

        boosting_rounds = range(1, n_rounds + 1)

        fig, ax = plt.subplots()

        # TODO: plot train and test MSE over boosting rounds

        #plt.show()

3.4.1: Set the learning rate to 0.7. At approximately which round does test MSE stop improving? Describe any visual evidence of “overfitting” you see in the plot.

Your response: TODO

3.4.2: Use your widget to try out different learning rates. Applying some of our principles from the choices of learning rate for gradient descent we saw back on HW 1, what learning rate would you prefer for this dataset? It is okay to speculate here as long as you justify your reasoning.

Your response: TODO

Click here to verify your plot

There may be some variation in the exact numbers due to randomness, but your plot for learning rate 0.7 should look something like this:

How to submit

Follow the instructions on the course website to submit your work. For part 1, you will submit hw3_foundations.ipynb and hw3_foundations.py.

HW 3 part 1

Contents

HW 3 part 1#

Part 1 Table of Contents and Rubric#

Notebook and function imports#

1. Variance and ensembles [1.25 pts]#

1.1 Variance definition [0.25 pts]#

1.2 Why bagging reduces variance [1 pt]#

2. Decision trees and bias-variance [1.25 pts]#

2.1 Visualizing decision trees [0.25 pts]#

Visualizing decision trees with plot_tree#

2.2 Bootstrap Sampling [0.25 pts]#

2.3 Bias and Variance [0.75 pts]#

3. Implementing Tree Ensembles [3 pts]#

3.1 Data Setup#

3.2 Tree Bagging [1.5 pts]#

3.3 Tree boosting [0.75 pts]#

3.4 Boosting MSE Widget [0.75 pts]#

Visualizing decision trees with `plot_tree`#