HW 2 part 1#

Classification: Foundations

—TODO your name here

Collaboration Statement

  • TODO brief statement on the nature of your collaboration.

  • TODO your collaborator’s names here

Part 1 Table of Contents and Rubric#

Section

Points

Deriving Gradients for Logistic Regression

1.5

Logistic Regression Gradient Descent Update

1.5

Logistic Regression Model Class

2

Total

5 pts

Notebook and function imports#

import numpy as np
# random number generator
rng = np.random.RandomState(42)

1. Deriving Gradients for Logistic Regression [1.5 pts]#

In this section, we’ll derive the gradient update rules for logistic regression.

From class, we saw that logistic regression uses a two-step model to produce predicted probabilities:

Step 1: Compute the linear combination \(h_i\):

\[ h_i = w_0 + w_1 x_{i,1} + w_2 x_{i,2} + \cdots + w_p x_{i,p} \]

Step 2: Pass \(h_i\) through the sigmoid function \(\sigma\) to produce a predicted probability:

\[ \sigma(h_i) = \frac{1}{1 + e^{-h_i}} \]

We pair this model with the log loss (also called binary cross-entropy) for a single data point \((\vec{x}_i, y_i)\):

\[ \ell_{\log, i} = -y_i \log(\sigma(h_i)) - (1 - y_i) \log(1 - \sigma(h_i)) \]

Sigmoid derivative identity

To compute the gradients of the log loss, we’ll need the following identity for the derivative of the sigmoid function (if you enjoy algebra, feel free to try deriving it yourself!):

\[ \frac{d}{dh}\sigma(h) = \sigma(h)(1 - \sigma(h)) \]

We’ll use this identity throughout the derivations below.

1.1 Some helpful derivatives [0.5 pts]#

We first compute a couple of derivatives that will be helpful for the log loss derivation. Show that:

\[ \frac{d}{dh}\log(\sigma(h)) = 1 - \sigma(h) \]

Calculus hints and reminders

The derivative of \(\log(x)\) is \(\frac{1}{x}\):

\[ \frac{d}{dx}\log(x) = \frac{1}{x} \]

You’ll then need to use the chain rule and substitute the sigmoid derivative identity from above. You don’t need to take the derivative of \(\sigma(h)\) from scratch!

Your response:

\[\begin{split} \frac{d}{dh}\log(\sigma(h)) &= \text{ TODO step 1 } \\ &= \text{ TODO step 2 } \\ &= 1 - \sigma(h) \end{split}\]

Next, also show that:

\[ \frac{d}{dh}\log(1 - \sigma(h)) = -\sigma(h) \]

Your response:

\[\begin{split} \frac{d}{dh}\log(1 - \sigma(h)) &= \text{ TODO step 1 } \\ &= \text{ TODO step 2 } \\ &= -\sigma(h) \end{split}\]

1.2 Gradient of log loss with respect to \(w_0\) [0.5 pts]#

Using your results from 1.1, derive the partial derivative of the single-point log loss with respect to \(w_0\):

\[\ell_{\log, i} = -y_i \log(\sigma(h_i)) - (1 - y_i) \log(1 - \sigma(h_i))\]

Your response (you can just write the final expression):

\[ \frac{\partial \ell_{\log, i}}{\partial w_0} = \text{ TODO } \]

Hint

Use the chain rule in two steps: first differentiate \(\ell_{\text{log}, i}\) with respect to \(h_i\) (using your results from 1.1), then differentiate \(h_i\) with respect to \(w_0\).

The final result should be quite simple after some cancellation: just two terms.

If you get stuck at any point of this derivation, please don’t hesitate to reach out on Ed or come to TA/office hours!

1.3 Gradient of log loss with respect to \(w_j\) [0.5 pts]#

Now derive the partial derivative of the single-point log loss \(\ell_{\text{log}, i}\) with respect to \(w_j\) (for \(j \in \{1, 2, \ldots, p\}\)).

Your response:

\[ \frac{\partial \ell_{\text{log}, i}}{\partial w_j} = \text{ TODO } \]

Hint

This should be very similar to 1.2, with the only difference being what \(\frac{\partial h_i}{\partial w_j}\) is as opposed to \(\frac{\partial h_i}{\partial w_0}\). The final result should have three terms.

2. Implementing Gradient Descent for Logistic Regression [1.5 pts]#

Now, let’s translate the math into code to build our own logistic regression model.

2.1 Sigmoid function [0.25 pts]#

We begin by implementing the sigmoid function that we saw in class:

\[ \sigma(h) = \frac{1}{1 + e^{-h}} \]

In NumPy, \(e^{h}\) is computed as np.exp(h).

Python Type Hint Unions

You might notice that the sigmoid() function has a type hint that looks like:

def sigmoid(h: np.ndarray | float)

The np.ndarray | float indicates that the function can take in either a numpy array or a scalar. This is an example of a type hint union, which indicates that multiple types are acceptable for the function parameter.

To be clear, you don’t need to do anything special in your implementation to handle this. A single line of code with np.exp will work for both scalar and array inputs!

def sigmoid(h: np.ndarray | float) -> np.ndarray | float:
    """Compute the sigmoid function.

    Args:
        h: a scalar or numpy array of any shape

    Returns:
        The sigmoid applied element-wise, same shape as input
    """
    # TODO your code here
    return None
if __name__ == "__main__":
    # Test sigmoid with scalar input
    assert np.isclose(sigmoid(0), 0.5), "sigmoid(0) should be 0.5"

    # Test sigmoid with array input
    result = sigmoid(np.array([-1, 0, 1]))
    expected = np.array([1 / (1 + np.exp(1)), 0.5, 1 / (1 + np.exp(-1))])
    assert np.allclose(result, expected), f"sigmoid([-1, 0, 1]) should be {expected}"

    # Test that sigmoid outputs are between 0 and 1
    assert np.all(result >= 0) and np.all(result <= 1), "sigmoid outputs should be between 0 and 1"

2.2 Logistic regression predictions [0.5 pts]#

Next, implement the logreg_predictions function that computes predicted probabilities for all examples in X at once.

Hint

This should look very similar to the linreg_predictions function from the last exercise in Worksheet 2. The only difference is that we now pass the linear regression result into the sigmoid function.

def logreg_predictions(X: np.ndarray, w: np.ndarray, w0: float) -> np.ndarray:
    """Compute logistic regression predicted probabilities for all examples in X.

    Args:
        X: data examples of shape (n, p)
        w: weights of shape (p,)
        w0: scalar intercept term

    Returns:
        np.ndarray of shape (n,) of predicted probabilities
    """

    # TODO your code here
    return None
if __name__ == "__main__":
    # Test logreg_predictions with simple data
    X = np.array([[1, 2], [3, 4]])
    w = np.array([0.5, -0.5])
    w0 = 0.0

    preds = logreg_predictions(X, w, w0)
    assert preds.shape == (2,), "Predictions should have shape (n,)"

    # h = [1*0.5 + 2*(-0.5), 3*0.5 + 4*(-0.5)] = [-0.5, -0.5]
    # sigmoid(-0.5) = 1 / (1 + exp(0.5))
    expected = sigmoid(np.array([-0.5, -0.5]))
    assert np.allclose(preds, expected), f"Predictions should be {expected}"

    # Test with zero weights: predictions should all be sigmoid(w0)
    preds_zero = logreg_predictions(X, np.array([0.0, 0.0]), 1.0)
    assert np.allclose(preds_zero, sigmoid(1.0)), "With zero weights, predictions should be sigmoid(w0)"

2.3 Batch gradient update rule with regularization [0.75 pts]#

Now let’s implement a single batch gradient descent step for logistic regression with ridge (L2) regularization. The regularized log loss with ridge over the full dataset is:

\[ \mathcal{L}_{\text{ridge log}} = \frac{1}{n}\sum_{i=1}^{n} \left(-y_i \log(\sigma(h_i)) - (1 - y_i) \log(1 - \sigma(h_i))\right) + \lambda \sum_{j=1}^{p} w_j^2 \]

This is just like how we added the L2 regularization term to the linear regression MSE loss function.

Taking the gradients using your results from Question 1, averaged over the dataset, plus the regularization term, we get:

\[ \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_0} = \frac{1}{n}\sum_{i=1}^{n}\underbrace{\frac{\partial \ell_{\log, i}}{\partial w_0}}_{\text{answer from 1.2}} \]
\[ \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_j} = \left( \frac{1}{n}\sum_{i=1}^{n}\underbrace{\frac{\partial \ell_{\log, i}}{\partial w_j}}_{\text{answer from 1.3}} \right) + 2\lambda w_j \]

Note that the regularization term \(2\lambda w_j\) only applies to the feature weights \(w_j\), not to the intercept \(w_0\).

We’ll now write a function to perform a single gradient descent update for the logistic regression weights, with regularization. Specifically the outputs will be:

\[ w_{0, \text{new}} = w_{0, \text{old}} - \alpha \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_0} \]

For each \(j = 1, \ldots, p\), we have:

\[ w_{j, \text{new}} = w_{j, \text{old}} - \alpha \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_j} \]

If we define the gradient vector as: \(\nabla \mathcal{L}_{\text{ridge log}} = (\frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_0}, \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_1}, \ldots, \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_p})\), then our code can perform the gradient descent simultaneously for all the weights:

\[\vec{w}_{\text{new}} = \vec{w}_{\text{old}} - \alpha \nabla \mathcal{L}_{\text{ridge log}}\]

Where \(\vec{w}_{\text{new}}\) and \(\vec{w}_{\text{old}}\) are vectors of shape \((p,)\) holding the new and old weights, respectively.

def logreg_grad_update(w0_old: float, w_old: np.ndarray, X: np.ndarray, y: np.ndarray, alpha: float, lam: float) -> tuple:
    """Perform a single gradient update for logistic regression with ridge regularization.

    Args:
        w0_old: the old intercept term
        w_old: the old weights of shape (p,)
        X: the feature matrix of shape (n, p)
        y: the target vector of shape (n,)
        alpha: the learning rate
        lam: the regularization parameter lambda

    Returns:
        (w0_new, w_new): the new intercept term and weights of shape (p,)
    """

    # Compute the predicted probabilities and the difference from the targets, y_hat is the same as sigma(h)!
    y_hat = logreg_predictions(X, w_old, w0_old)
    diff = (y_hat - y).reshape(-1, 1)

    w0_new = None
    w_new = None

    assert isinstance(w0_new, (float, np.floating)), "The new intercept term should be a scalar"
    assert w_new.shape == w_old.shape, "The new weights should have the same shape as the old weights"

    return w0_new, w_new

Hints

Use the diff variable to compute both updates. Both w0_new and w_new can be computed in a single line of code each, with no loops needed.

For w_new: you’ll need to broadcast diff against X to compute the gradient for all weights simultaneously. Since X is an (n, p) shape array, diff needs to be reshaped to a 2D array with shape (n, 1) via reshape(-1, 1) in order for the broadcasting to work:

diff.reshape(-1, 1) (2d array): n x 1
X                   (2d array): n x p
result              (2d array): n x p

Don’t forget to add the regularization term to w_new as well!

if __name__ == "__main__":
    # Test with simple data
    X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    y = np.array([0.0, 1.0, 1.0])
    w_old = np.array([0.0, 0.0])
    w0_old = 0.0
    alpha = 0.1
    lam = 0.0  # no regularization for this test

    w0_new, w_new = logreg_grad_update(w0_old, w_old, X, y, alpha, lam)

    # With zero weights and w0=0, all predictions are sigmoid(0) = 0.5
    # diff = [0.5-0, 0.5-1, 0.5-1] = [0.5, -0.5, -0.5]
    # mean(diff) = -1/6
    # w0_new = 0 - 0.1 * (-1/6) = 1/60
    assert np.isclose(w0_new, 1/60), f"w0_new should be {1/60}, got {w0_new}"

    # diff.reshape(-1, 1) * X = [[0.5, 1.0], [-1.5, -2.0], [-2.5, -3.0]]
    # mean over rows = [-7/6, -4/3]
    # w_new = [0, 0] - 0.1 * [-7/6, -4/3] = [7/60, 4/30]
    assert np.allclose(w_new, np.array([7/60, 2/15])), f"w_new should be [7/60, 2/15], got {w_new}"

    # Test with regularization
    w_old_reg = np.array([10, 10])
    w0_new_reg, w_new_reg = logreg_grad_update(0.0, w_old_reg, X, y, alpha, lam=0.1)
    assert w_new_reg.shape == (2,), "w_new with regularization should have shape (p,)"

    assert np.allclose(w_new_reg, np.array([293/30, 146/15])), f"w_new with regularization should be [293/30, 146/15], got {w_new_reg}"

3. Logistic Regression Model Class [2 pts]#

Below is a gradient descent implementation that uses your logreg_grad_update function from the previous question. It follows the same pattern as the one you wrote in HW 1.

def logreg_grad_descent(X: np.ndarray, y: np.ndarray, alpha: float, lam: float, max_iters: int = 5000) -> tuple:
    """Perform gradient descent for logistic regression with ridge regularization.

    Args:
        X: the feature matrix of shape (n, p)
        y: the target vector of shape (n,)
        alpha: the learning rate
        lam: the regularization parameter
        max_iters: the maximum number of iterations

    Returns:
        (w0, w): the final intercept term and weights of shape (p,)
    """
    p = X.shape[1]
    w0_old = 0.0
    w_old = rng.randn(p)

    # Initialize in case we hit max_iters without converging
    w0_new, w_new = w0_old, w_old

    num_iters = 0

    while True:
        w0_new, w_new = logreg_grad_update(w0_old, w_old, X, y, alpha, lam)

        # Check for convergence using the full parameter vector (including intercept)
        step_norm = np.sqrt((w0_new - w0_old) ** 2 + np.sum((w_new - w_old) ** 2))
        if step_norm < 1e-4:
            break

        if num_iters >= max_iters:
            print(f"Max iterations reached: {max_iters}")
            break

        w0_old, w_old = w0_new, w_new
        
        num_iters += 1

    return w0_new, w_new

Now, let’s put everything together into a MHCLogisticRegressor class. Before we define the class, let’s first take a look at a useful numpy function: np.argmax().

np.argmax() returns the index of the maximum value in an array. When used with axis=1 on a 2D array, it returns the index of the maximum value in each row. For example:

if __name__ == "__main__":
    proba = np.array([[0.8, 0.2],  # class 0 probability is 0.8
                      [0.3, 0.7],  # class 1 probability is 0.7
                      [0.6, 0.4]]) # class 0 probability is 0.6

    # prints the index of the maximum value in each row
    print(np.argmax(proba, axis=1)) # outputs: array([0, 1, 0])
[0 1 0]

This is useful for converting predicted probabilities into class predictions: if the class 0 probability (column 0) is higher, argmax returns 0; if the class 1 probability (column 1) is higher, it returns 1.

ML model class documentation

Like HW 1, please also ensure that you document your class and its methods.

from sklearn.base import BaseEstimator
from typing import Self

class MHCLogisticRegressor(BaseEstimator):
    """TODO class description"""

    def __init__(self, alpha: float, lam: float, max_iters: int = 5000):
        """TODO constructor description"""

        # TODO initialize the self parameters

    def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
        """TODO method description"""

        # TODO: call logreg_grad_descent to find the optimal weights
        self.w0_, self.w_ = None

        return self

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """TODO method description"""

        # TODO A (n,) shape array of class 1 predicted probabilities
        class1_proba = None

        # TODO A (n,) shape array of class 0 predicted probabilities
        class0_proba = None

        assert class0_proba.shape == class1_proba.shape, "The class probabilities should have the same shape"

        # Stacks the class 0 and class 1 probabilities into a single array of shape (n, 2)
        return np.column_stack([class0_proba, class1_proba])

    def predict(self, X: np.ndarray) -> np.ndarray:
        """TODO method description"""

        # A (n,) shape array of predictions: use argmax with an axis reduction on the predicted probabilities array
        predictions = None

        assert predictions.shape == (X.shape[0],), "The predictions should have shape (n,)"

        return predictions
if __name__ == "__main__":
    # Some linearly separable data
    X = np.array([[1.0, 2.0], [2.0, 3.0], [3.0, 4.0],
                  [6.0, 7.0], [7.0, 8.0], [8.0, 9.0]])
    y = np.array([0, 0, 0, 1, 1, 1])

    model = MHCLogisticRegressor(alpha=0.05, lam=0.0, max_iters=10000)
    model = model.fit(X, y)

    # Check predict_proba shape
    proba = model.predict_proba(X)
    assert proba.shape == (6, 2), f"predict_proba should return shape (n, 2), got {proba.shape}"

    # Check that probabilities sum to 1
    assert np.allclose(proba.sum(axis=1), 1.0), "Probabilities should sum to 1 for each example"

    # Check predictions
    preds = model.predict(X)
    assert preds.shape == (6,), f"predict should return shape (n,), got {preds.shape}"
    assert np.array_equal(preds, y), f"Model should correctly classify linearly separable data, got {preds}"

How to submit

Follow the instructions on the course website to submit your work. For part 1, you will submit hw2_foundations.ipynb and hw2_foundations.py.