Activity 6: Holdout and model evaluation#
2026-02-12
Imports and previous models#
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.base import BaseEstimator
from typing import Self
def rmse(y_hat: np.ndarray, y: np.ndarray) -> float:
"""Root mean squared error."""
assert y_hat.shape == y.shape
# TODO: compute the RMSE, using the functions we practiced on WS 1
return np.sqrt( np.mean( (y_hat - y)**2))
Below are implementations of the models we’ve built so far:
Tip
If you click on arrow near the blue bar at the top of the heading, you can collapse the code which can help organize the notebook.
class MeanRegressor(BaseEstimator):
"""Simple model that predicts the mean of the training data."""
# constructors in Python are defined using the `__init__` method
# A quirk of Python OOP: the first argument is always `self`, which refers to the object itself
def __init__(self):
pass
# fit method trains the model on the given data, and always takes X and y as arguments
def fit(self, X, y):
"""Fits the mean regressor to the training data.
Args:
X: the data examples of shape (n, p)
y: the answers vector of shape (n,)
Returns:
self: the fitted model
"""
# fitted model parameters are stored in `self` as instance variables and suffixed with `_`
self.mean_ = np.mean(y)
return self
# predict method makes predictions on new data, and always takes X as an argument
def predict(self, X):
"""Predicts the values for new points X.
This model will only predict the mean value of the fitted data for all new points.
Args:
X: the new points of shape (n_new, p)
Returns:
the predicted values of shape (n_new,)
"""
predictions = []
for x in X:
predictions.append(self.mean_)
return np.array(predictions)
def find_k_nearest_indices(x: np.ndarray, X_train: np.ndarray, k: int) -> list:
"""Finds the indices of the k nearest neighbors to a new point x.
Args:
x: the new point of shape (m,)
X_train: the training data of shape (n, m)
k: the number of nearest neighbors to find
Returns:
the indices of the k nearest neighbors to x in X_train
"""
dists = np.sqrt(np.sum((X_train - x)**2, axis=1))
sorted_indices = np.argsort(dists)
return sorted_indices[:k]
# Our KNNRegressor class extends the BaseEstimator class
class KNNRegressor(BaseEstimator):
"""KNN regressor model."""
def __init__(self, n_neighbors: int):
"""Initializes the KNN regressor model.
Args:
n_neighbors: the number of neighbors to use for the KNN regressor
"""
# self.var_name is an instance variable that can be accessed by any method in the class
self.n_neighbors = n_neighbors
def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
"""Fits the KNN regressor to the training data.
Note that KNN models do not have any functions or features that need to be fit,
so all this method does is store the training data as instance variables.
Args:
X: the feature matrix of shape (n, m)
y: the target vector of shape (n,)
Returns:
self: the fitted model
"""
# Use self to store the training data in the instance variables
self.X_ = X
self.y_ = y
return self
def predict(self, X: np.ndarray) -> np.ndarray:
"""Predicts the values of a set of new points X.
Args:
X: the new points of shape (n_new, m)
Returns:
the predicted values of shape (n_new,)
"""
assert self.X_.shape[1] == X.shape[1], "X must have the same number of features as the training data"
predictions = []
# this loops over the rows of X
for x in X:
# Find the k nearest neighbors to x
k_nearest_indices = find_k_nearest_indices(x, self.X_, self.n_neighbors)
# Compute the average of the k nearest neighbors, append to predictions
predictions.append(np.mean(self.y_[k_nearest_indices]))
return np.array(predictions)
# NOTE: this is called "simple" as a statistical term for one feature, not because it's simple to implement
class SimpleLinearRegression(BaseEstimator):
def __init__(self):
# There are no (hyper)parameters to set
pass
def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
"""Fit the model to training data.
Args:
X: a 2D numpy array of shape (n, 1)
y: a 1D numpy array of shape (n,)
Returns:
self: the fitted model
"""
n = X.shape[0]
x = X.flatten()
# NOTE: we need to be super careful about the shape of the arrays!
self.w1_ = (n * np.sum(x * y) - np.sum(x) * np.sum(y)) / (n * np.sum(x**2) - (np.sum(x) ** 2))
self.w0_ = np.mean(y) - self.w1_ * np.mean(x)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
"""Predict on new data.
Args:
X: a 2D numpy array of shape (n_new, 1)
Returns:
y_hat: a 1D numpy array of shape (n_new,)
"""
return self.w0_ + self.w1_ * X.flatten()
Part 1: train/test holdout split#
# Load in data
housing_df = pd.read_csv("~/COMSC-335/data/housing_data.csv")
# TODO examine the first few rows and last few rows of the data
# TODO get the column names
# TODO get the number of rows and columns
The features we have are:
MedInc: median income in block groupHouseAge: median house age in block groupAveRooms: average number of rooms per householdAveBedrms: average number of bedrooms per householdPopulation: block group populationAveOccup: average number of household membersLatitude: block group latitudeLongitude: block group longitude
The answer we want to predict is:
MedHouseVal: median house value in $100,000s
Single columns in pandas can be accessed with the column name:
#housing_df["HouseAge"]
Multiple columns can be accessed with a list of column names:
# housing_df[["AveRooms", "HouseAge"]]
There are many ways to select rows from a pandas DataFrame. Similar to numpy, we can use square brackets and slicing:
# select the first 1000 rows
housing_df[:1000]
# select the last 1000 rows
housing_df[-1000:]
# select the rows starting at index 1000
housing_df[1000:]
Let’s split the data into training and test sets. We’ll use the first 80% of the data for training and the last 20% for testing. Complete the code below to create the X_train, X_test, y_train, and y_test variables.
# TODO split the data into training and test sets
housing_train = None
housing_test = None
# TODO what are some assertions we can do to test our splits?
Columns can also be removed with the drop() method:
# remove the y column
X_train = housing_train.drop(columns=["MedHouseVal"])
X_test = housing_test.drop(columns=["MedHouseVal"])
# save the y column
y_train = housing_train["MedHouseVal"]
y_test = housing_test["MedHouseVal"]
Part 2: Feature exploration#
Seaborn is a high-level plotting library that has strong integration with pandas DataFrames. For a given plot, we often specify the following parameters:
x: the column name of the x-axisy: the column name of the y-axisdata: the pandas DataFrame to plot
Warning
Remember that we should only be looking at the training data when exploring the relationships between features and the target!
We’ll visualize the data using sns.scatterplots below:
# plots HouseAge vs MedHouseVal
# alpha controls the transparency of the points
sns.scatterplot(x="HouseAge", y="MedHouseVal", data=housing_train, alpha=0.1)
The features we have are:
MedInc: median income in block groupHouseAge: median house age in block groupAveRooms: average number of rooms per householdAveBedrms: average number of bedrooms per householdPopulation: block group populationAveOccup: average number of household membersLatitude: block group latitudeLongitude: block group longitude
The answer we want to predict is:
MedHouseVal: median house value in $100,000s
In groups of 2-3 around you, split up the features to plot on the x-axis, always keeping the y-axis as MedHouseVal. Compare plots to discuss what the “best” 2-3 features for predicting MedHouseVal are, and vote for them in the PollEverywhere:
# TODO your scatterplots here
sns.scatterplot(x="TODO", y="MedHouseVal", data=housing_train)
Part 3: Model evaluation and benchmarking#
Let’s now fit a MeanRegressor model as a baseline for us to benchmark our other models against:
# TODO update this to change the features we're using
feats_to_include = ["HouseAge"]
# initialize the model
regressor = MeanRegressor()
# fit the model
regressor.fit(X_train[feats_to_include].to_numpy(), y_train.to_numpy())
# make predictions on both the training and test sets
y_hat_train = regressor.predict(X_train[feats_to_include].to_numpy())
y_hat_test = regressor.predict(X_test[feats_to_include].to_numpy())
# compute the RMSE
rmse_train = rmse(y_hat_train, y_train)
rmse_test = rmse(y_hat_test, y_test)
# Rounds the RMSE to 2 decimal places
print(f"MeanRegressor RMSE on training set: {rmse_train:.2f}")
print(f"MeanRegressor RMSE on test set: {rmse_test:.2f}")
Now, let’s go model-hunting: find a model that beats the MeanRegressor RMSE (lower is better) on the test set. Copy the cell above and modify it to try new model and feature combinations:
KNNRegressor: try different values ofn_neighborsSimpleLinearRegression: try different features, but note that this model only takes in 1 featureLinearRegression: try different feature combinations
Again, you can discuss with folks around you on a strategy to search through models.
As you try new models and features, discuss with folks around you any discrepancies you see between the training and test set RMSEs. Is one number usually higher than the other?
After trying out a few models with your group, submit your best test set RMSE rounded to 2 decimal places to the PollEverywhere: