HW 2 part 1#
Classification: Foundations
—TODO your name here
Collaboration Statement
TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here
Part 1 Table of Contents and Rubric#
Section |
Points |
|---|---|
Deriving Gradients for Logistic Regression |
1.5 |
Logistic Regression Gradient Descent Update |
1.5 |
Logistic Regression Model Class |
2 |
Total |
5 pts |
Notebook and function imports#
import numpy as np
# random number generator
rng = np.random.RandomState(42)
1. Deriving Gradients for Logistic Regression [1.5 pts]#
In this section, we’ll derive the gradient update rules for logistic regression.
From class, we saw that logistic regression uses a two-step model to produce predicted probabilities:
Step 1: Compute the linear combination \(h_i\):
Step 2: Pass \(h_i\) through the sigmoid function \(\sigma\) to produce a predicted probability:
We pair this model with the log loss (also called binary cross-entropy) for a single data point \((\vec{x}_i, y_i)\):
Sigmoid derivative identity
To compute the gradients of the log loss, we’ll need the following identity for the derivative of the sigmoid function (if you enjoy algebra, feel free to try deriving it yourself!):
We’ll use this identity throughout the derivations below.
1.1 Some helpful derivatives [0.5 pts]#
We first compute a couple of derivatives that will be helpful for the log loss derivation. Show that:
Calculus hints and reminders
The derivative of \(\log(x)\) is \(\frac{1}{x}\):
You’ll then need to use the chain rule and substitute the sigmoid derivative identity from above. You don’t need to take the derivative of \(\sigma(h)\) from scratch!
Your response:
Next, also show that:
Your response:
1.2 Gradient of log loss with respect to \(w_0\) [0.5 pts]#
Using your results from 1.1, derive the partial derivative of the single-point log loss with respect to \(w_0\):
Your response (you can just write the final expression):
Hint
Use the chain rule in two steps: first differentiate \(\ell_{\text{log}, i}\) with respect to \(h_i\) (using your results from 1.1), then differentiate \(h_i\) with respect to \(w_0\).
The final result should be quite simple after some cancellation: just two terms.
If you get stuck at any point of this derivation, please don’t hesitate to reach out on Ed or come to TA/office hours!
1.3 Gradient of log loss with respect to \(w_j\) [0.5 pts]#
Now derive the partial derivative of the single-point log loss \(\ell_{\text{log}, i}\) with respect to \(w_j\) (for \(j \in \{1, 2, \ldots, p\}\)).
Your response:
Hint
This should be very similar to 1.2, with the only difference being what \(\frac{\partial h_i}{\partial w_j}\) is as opposed to \(\frac{\partial h_i}{\partial w_0}\). The final result should have three terms.
2. Implementing Gradient Descent for Logistic Regression [1.5 pts]#
Now, let’s translate the math into code to build our own logistic regression model.
2.1 Sigmoid function [0.25 pts]#
We begin by implementing the sigmoid function that we saw in class:
In NumPy, \(e^{h}\) is computed as np.exp(h).
Python Type Hint Unions
You might notice that the sigmoid() function has a type hint that looks like:
def sigmoid(h: np.ndarray | float)
The np.ndarray | float indicates that the function can take in either a numpy array or a scalar. This is an example of a type hint union, which indicates that multiple types are acceptable for the function parameter.
To be clear, you don’t need to do anything special in your implementation to handle this. A single line of code with np.exp will work for both scalar and array inputs!
def sigmoid(h: np.ndarray | float) -> np.ndarray | float:
"""Compute the sigmoid function.
Args:
h: a scalar or numpy array of any shape
Returns:
The sigmoid applied element-wise, same shape as input
"""
# TODO your code here
return None
if __name__ == "__main__":
# Test sigmoid with scalar input
assert np.isclose(sigmoid(0), 0.5), "sigmoid(0) should be 0.5"
# Test sigmoid with array input
result = sigmoid(np.array([-1, 0, 1]))
expected = np.array([1 / (1 + np.exp(1)), 0.5, 1 / (1 + np.exp(-1))])
assert np.allclose(result, expected), f"sigmoid([-1, 0, 1]) should be {expected}"
# Test that sigmoid outputs are between 0 and 1
assert np.all(result >= 0) and np.all(result <= 1), "sigmoid outputs should be between 0 and 1"
2.2 Logistic regression predictions [0.5 pts]#
Next, implement the logreg_predictions function that computes predicted probabilities for all examples in X at once.
Hint
This should look very similar to the linreg_predictions function from the last exercise in Worksheet 2. The only difference is that we now pass the linear regression result into the sigmoid function.
def logreg_predictions(X: np.ndarray, w: np.ndarray, w0: float) -> np.ndarray:
"""Compute logistic regression predicted probabilities for all examples in X.
Args:
X: data examples of shape (n, p)
w: weights of shape (p,)
w0: scalar intercept term
Returns:
np.ndarray of shape (n,) of predicted probabilities
"""
# TODO your code here
return None
if __name__ == "__main__":
# Test logreg_predictions with simple data
X = np.array([[1, 2], [3, 4]])
w = np.array([0.5, -0.5])
w0 = 0.0
preds = logreg_predictions(X, w, w0)
assert preds.shape == (2,), "Predictions should have shape (n,)"
# h = [1*0.5 + 2*(-0.5), 3*0.5 + 4*(-0.5)] = [-0.5, -0.5]
# sigmoid(-0.5) = 1 / (1 + exp(0.5))
expected = sigmoid(np.array([-0.5, -0.5]))
assert np.allclose(preds, expected), f"Predictions should be {expected}"
# Test with zero weights: predictions should all be sigmoid(w0)
preds_zero = logreg_predictions(X, np.array([0.0, 0.0]), 1.0)
assert np.allclose(preds_zero, sigmoid(1.0)), "With zero weights, predictions should be sigmoid(w0)"
2.3 Batch gradient update rule with regularization [0.75 pts]#
Now let’s implement a single batch gradient descent step for logistic regression with ridge (L2) regularization. The regularized log loss with ridge over the full dataset is:
This is just like how we added the L2 regularization term to the linear regression MSE loss function.
Taking the gradients using your results from Question 1, averaged over the dataset, plus the regularization term, we get:
Note that the regularization term \(2\lambda w_j\) only applies to the feature weights \(w_j\), not to the intercept \(w_0\).
We’ll now write a function to perform a single gradient descent update for the logistic regression weights, with regularization. Specifically the outputs will be:
For each \(j = 1, \ldots, p\), we have:
If we define the gradient vector as: \(\nabla \mathcal{L}_{\text{ridge log}} = (\frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_0}, \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_1}, \ldots, \frac{\partial \mathcal{L}_{\text{ridge log}}}{\partial w_p})\), then our code can perform the gradient descent simultaneously for all the weights:
Where \(\vec{w}_{\text{new}}\) and \(\vec{w}_{\text{old}}\) are vectors of shape \((p,)\) holding the new and old weights, respectively.
def logreg_grad_update(w0_old: float, w_old: np.ndarray, X: np.ndarray, y: np.ndarray, alpha: float, lam: float) -> tuple:
"""Perform a single gradient update for logistic regression with ridge regularization.
Args:
w0_old: the old intercept term
w_old: the old weights of shape (p,)
X: the feature matrix of shape (n, p)
y: the target vector of shape (n,)
alpha: the learning rate
lam: the regularization parameter lambda
Returns:
(w0_new, w_new): the new intercept term and weights of shape (p,)
"""
# Compute the predicted probabilities and the difference from the targets, y_hat is the same as sigma(h)!
y_hat = logreg_predictions(X, w_old, w0_old)
diff = (y_hat - y).reshape(-1, 1)
w0_new = None
w_new = None
assert isinstance(w0_new, (float, np.floating)), "The new intercept term should be a scalar"
assert w_new.shape == w_old.shape, "The new weights should have the same shape as the old weights"
return w0_new, w_new
Hints
Use the diff variable to compute both updates. Both w0_new and w_new can be computed in a single line of code each, with no loops needed.
For w_new: you’ll need to broadcast diff against X to compute the gradient for all weights simultaneously. Since X is an (n, p) shape array, diff needs to be reshaped to a 2D array with shape (n, 1) via reshape(-1, 1) in order for the broadcasting to work:
diff.reshape(-1, 1) (2d array): n x 1
X (2d array): n x p
result (2d array): n x p
Don’t forget to add the regularization term to w_new as well!
if __name__ == "__main__":
# Test with simple data
X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
y = np.array([0.0, 1.0, 1.0])
w_old = np.array([0.0, 0.0])
w0_old = 0.0
alpha = 0.1
lam = 0.0 # no regularization for this test
w0_new, w_new = logreg_grad_update(w0_old, w_old, X, y, alpha, lam)
# With zero weights and w0=0, all predictions are sigmoid(0) = 0.5
# diff = [0.5-0, 0.5-1, 0.5-1] = [0.5, -0.5, -0.5]
# mean(diff) = -1/6
# w0_new = 0 - 0.1 * (-1/6) = 1/60
assert np.isclose(w0_new, 1/60), f"w0_new should be {1/60}, got {w0_new}"
# diff.reshape(-1, 1) * X = [[0.5, 1.0], [-1.5, -2.0], [-2.5, -3.0]]
# mean over rows = [-7/6, -4/3]
# w_new = [0, 0] - 0.1 * [-7/6, -4/3] = [7/60, 4/30]
assert np.allclose(w_new, np.array([7/60, 2/15])), f"w_new should be [7/60, 2/15], got {w_new}"
# Test with regularization
w_old_reg = np.array([10, 10])
w0_new_reg, w_new_reg = logreg_grad_update(0.0, w_old_reg, X, y, alpha, lam=0.1)
assert w_new_reg.shape == (2,), "w_new with regularization should have shape (p,)"
assert np.allclose(w_new_reg, np.array([293/30, 146/15])), f"w_new with regularization should be [293/30, 146/15], got {w_new_reg}"
3. Logistic Regression Model Class [2 pts]#
Below is a gradient descent implementation that uses your logreg_grad_update function from the previous question. It follows the same pattern as the one you wrote in HW 1.
def logreg_grad_descent(X: np.ndarray, y: np.ndarray, alpha: float, lam: float, max_iters: int = 5000) -> tuple:
"""Perform gradient descent for logistic regression with ridge regularization.
Args:
X: the feature matrix of shape (n, p)
y: the target vector of shape (n,)
alpha: the learning rate
lam: the regularization parameter
max_iters: the maximum number of iterations
Returns:
(w0, w): the final intercept term and weights of shape (p,)
"""
p = X.shape[1]
w0_old = 0.0
w_old = rng.randn(p)
# Initialize in case we hit max_iters without converging
w0_new, w_new = w0_old, w_old
num_iters = 0
while True:
w0_new, w_new = logreg_grad_update(w0_old, w_old, X, y, alpha, lam)
# Check for convergence using the full parameter vector (including intercept)
step_norm = np.sqrt((w0_new - w0_old) ** 2 + np.sum((w_new - w_old) ** 2))
if step_norm < 1e-4:
break
if num_iters >= max_iters:
print(f"Max iterations reached: {max_iters}")
break
w0_old, w_old = w0_new, w_new
num_iters += 1
return w0_new, w_new
Now, let’s put everything together into a MHCLogisticRegressor class. Before we define the class, let’s first take a look at a useful numpy function: np.argmax().
np.argmax() returns the index of the maximum value in an array. When used with axis=1 on a 2D array, it returns the index of the maximum value in each row. For example:
if __name__ == "__main__":
proba = np.array([[0.8, 0.2], # class 0 probability is 0.8
[0.3, 0.7], # class 1 probability is 0.7
[0.6, 0.4]]) # class 0 probability is 0.6
# prints the index of the maximum value in each row
print(np.argmax(proba, axis=1)) # outputs: array([0, 1, 0])
[0 1 0]
This is useful for converting predicted probabilities into class predictions: if the class 0 probability (column 0) is higher, argmax returns 0; if the class 1 probability (column 1) is higher, it returns 1.
ML model class documentation
Like HW 1, please also ensure that you document your class and its methods.
from sklearn.base import BaseEstimator
from typing import Self
class MHCLogisticRegressor(BaseEstimator):
"""TODO class description"""
def __init__(self, alpha: float, lam: float, max_iters: int = 5000):
"""TODO constructor description"""
# TODO initialize the self parameters
def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
"""TODO method description"""
# TODO: call logreg_grad_descent to find the optimal weights
self.w0_, self.w_ = None
return self
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""TODO method description"""
# TODO A (n,) shape array of class 1 predicted probabilities
class1_proba = None
# TODO A (n,) shape array of class 0 predicted probabilities
class0_proba = None
assert class0_proba.shape == class1_proba.shape, "The class probabilities should have the same shape"
# Stacks the class 0 and class 1 probabilities into a single array of shape (n, 2)
return np.column_stack([class0_proba, class1_proba])
def predict(self, X: np.ndarray) -> np.ndarray:
"""TODO method description"""
# A (n,) shape array of predictions: use argmax with an axis reduction on the predicted probabilities array
predictions = None
assert predictions.shape == (X.shape[0],), "The predictions should have shape (n,)"
return predictions
if __name__ == "__main__":
# Some linearly separable data
X = np.array([[1.0, 2.0], [2.0, 3.0], [3.0, 4.0],
[6.0, 7.0], [7.0, 8.0], [8.0, 9.0]])
y = np.array([0, 0, 0, 1, 1, 1])
model = MHCLogisticRegressor(alpha=0.05, lam=0.0, max_iters=10000)
model = model.fit(X, y)
# Check predict_proba shape
proba = model.predict_proba(X)
assert proba.shape == (6, 2), f"predict_proba should return shape (n, 2), got {proba.shape}"
# Check that probabilities sum to 1
assert np.allclose(proba.sum(axis=1), 1.0), "Probabilities should sum to 1 for each example"
# Check predictions
preds = model.predict(X)
assert preds.shape == (6,), f"predict should return shape (n,), got {preds.shape}"
assert np.array_equal(preds, y), f"Model should correctly classify linearly separable data, got {preds}"
How to submit
Follow the instructions on the course website to submit your work. For part 1, you will submit hw2_foundations.ipynb and hw2_foundations.py.