Activity 4: Implementing linear regression#

2026-02-05

import numpy as np
from sklearn.base import BaseEstimator
from typing import Self

Part 1: Python tuples and more NumPy shape practice#

A Python tuple is similar to a list, but it is immutable (cannot be modified). Declared using parentheses ( ).

tup = (1,2,3)

print(type(tup))

# Accessing elements is similar to lists
tup[-1]
<class 'tuple'>
3

A common Python shorthand is what is called tuple unpacking. We can assign the elements of a tuple to multiple variables at once in one line:

# unpacking the tuple into variables a, b, c
a, b, c = tup

print(a)
print(b)
print(c)
1
2
3

This shorthand extends to Python function returns as well:

# a function that returns a tuple
def return_two_values():
    return "hello", 2


tup = return_two_values()
print(tup)

# Can even unpack the function return in one line
a, b = return_two_values()
print(a)
print(b)
('hello', 2)
hello
2

Combining what we’ve learned so far with NumPy slicing, complete the following function:

def split_data(data: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """
    Split the data into X and y.

    Assume that the last column of the data is the target y.

    Args:
        data: a 2D numpy array of shape (n, p+1)

    Returns:
        X: a 2D numpy array of shape (n, p)
        y: a 1D numpy array of shape (n,)
    """
    #        r, c
    X = data[:, 0:-1]
    y = data[:, -1]
    
    return X, y

You can test your function with the housing data we’ve been looking at:

beds

sqft

year

price ($100k)

2

2000

1999

6

4

3500

1950

11

1

1200

1980

4

1

1000

2010

3

data = np.array([
    #  beds, sqft, year, price (100k)
    [     2, 2000, 1999,      6],
    [     4, 3500, 1950,     11],
    [     1, 1200, 1980,      4],
    [     1, 1000, 2010,      3],
])

X, y = split_data(data)

# TODO let's write some assertion tests **first**
assert X.shape == (4,3)
assert y.shape == (4,)

Part 2: Implementing linear regression as a closed-form solution#

A really common gotcha in NumPy: a 1D array of shape (n,) is not the same as a 2D array of shape (n, 1):

# This selects the first column of X as a 1D array
X[:, 0]
array([2, 4, 1, 1])
# This selects the first column of X as a 2D array
X[:, 0:1]
array([[2],
       [4],
       [1],
       [1]])
y
array([ 6, 11,  4,  3])

When multiplying arrays of the same shape, NumPy will perform an element-wise multiplication:

X[:, 0] * y
array([12, 44,  4,  3])

When multiplying arrays of different shapes, NumPy will perform a broadcasting operation where the 1D array is “stretched” to match the shape of the 2D array. More on this in future classes!

X[:, 0:1] * y
array([[12, 22,  8,  6],
       [24, 44, 16, 12],
       [ 6, 11,  4,  3],
       [ 6, 11,  4,  3]])

For today, we need to take an extra step to convert any 2D array to a 1D array:

# flatten() converts a 2D array of any shape to a 1D array
X[:, 0:1].flatten()
(4,)

Here are the direct solutions for \(w_0\) and \(w_1\):

\[ w_1 = \frac{n \sum_{i=1}^{n} x_i y_i - \sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n \sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2} \]
\[ w_0 = \frac{1}{n} \sum_{i=1}^{n} y_i - \frac{w_1}{n} \sum_{i=1}^{n} x_i \]

And the simple linear regression prediction is:

\[ \hat{y} = w_0 + w_1 x \]

Complete the fit and predict methods below for the SimpleLinearRegression class.

Tip

We can use np.mean(x) to compute \(\frac{1}{n}\sum_{i=1}^{n} x_i\) in one operation.

# NOTE: this is called "simple" as a statistical term for one feature, not because it's simple to implement
class SimpleLinearRegression(BaseEstimator):

    def __init__(self):
        # There are no (hyper)parameters to set
        pass


    def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
        """Fit the model to training data.

        Args:
            X: a 2D numpy array of shape (n, 1)
            y: a 1D numpy array of shape (n,)

        Returns:
            self: the fitted model
        """
        

        n = X.shape[0]
        x = X.flatten()
        # NOTE: we need to be super careful about the shape of the arrays!
        self.w1_ = (n * np.sum(x * y) - np.sum(x) * np.sum(y)) / (n * np.sum(x**2) - (np.sum(x) ** 2))

        # TODO compute w0 and assign it to an instance variable
        self.w0_ = np.mean(y) - self.w1_ * np.mean(x)
        
        return self


    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict on new data.

        Args:
            X: a 2D numpy array of shape (n_new, 1)

        Returns:
            y_hat: a 1D numpy array of shape (n_new,)
        """
        # TODO: compute the simple linear regression prediction
        # This can be done with one line of code
        return self.w0_ + self.w1_ * X.flatten()
        
from sklearn.linear_model import LinearRegression


# Select the bedrooms feature as a 2D array
X_beds = data[:, 0:1]

# Let's test our implementation against sklearn's implementation
sk_model = LinearRegression()
sk_model.fit(X_beds, y)

# TODO create our SimpleLinearRegression model here
our_model = SimpleLinearRegression()
# TODO fit our model
our_model.fit(X_beds, y)

# TODO predict on the training data
y_hat_ours = our_model.predict(X_beds)

# Check that the predictions are the same
y_hat_sk = sk_model.predict(X_beds)

assert np.allclose(y_hat_sk, y_hat_ours)

Let’s actually see how well our model fits the data with our loss function. Sometimes, we take the square root of the mean squared error to get the root mean squared error (RMSE) – this converts the units of the error back to the units of the target variable – in this case, hundreds of thousands of dollars.

\[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2} \]
def rmse(y_hat: np.ndarray, y: np.ndarray) -> float:
    """Root mean squared error."""
    assert y_hat.shape == y.shape

    # TODO: compute the RMSE, using the functions we practiced on WS 1
    return np.sqrt( np.mean( (y_hat - y)**2))

# Multiply by 100k to get dollars
print(rmse(y_hat_ours, y) * 100000)
35355.33905932738

To the nearest whole dollar, what is the RMSE of our model?

Your answer: https://pollev.com/tliu

Finally, let’s compare it to the average prediction model we computed by hand in class on Tuesday. The average house price in our dataset is $600,000:

avg_price = np.mean(y)
avg_preds = np.array([avg_price] * len(y))

print(rmse(avg_preds, y) * 100000)
308220.7001484488