HW 1 part 2#
Linear Regression: Models
—TODO your name here
Collaboration Statement
TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here
Part 2 Table of Contents and Rubric#
Section |
Points |
|---|---|
Linear Regression Implementation |
1.5 |
Analysis |
2 |
Ethics |
1.5 |
Reflection |
0.5 |
Total |
5.5 pts |
Notebook and function imports#
Tip
If you click on the vertical blue bar on the left of a cell, you can collapse the code which can help organize the notebook as you work through the project.
If you have tested your implementation in Part 1 against the autograder, you would have generated a file called hw1_foundations.py. Let’s now import those functions into this notebook for use in Part 2.
If you are running this notebook on the JupyterHub allocated for the course:
Open the file browser by going to the menu bar “View -> File Browser”
Navigate to
comsc335.github.io/hws/, you should see yourhw1_models.ipynbfile in that folderClick on the upload button in the upper right and upload the
hw1_foundations.pyfile to this directoryRun the following cell below to import the functions.
import numpy as np
from sklearn.base import BaseEstimator
from typing import Self
rng = np.random.RandomState(42)
# import your functions from Part 1
from hw1_foundations import linreg_grad_descent, mse_loss
3. Linear regression model class [1.5 pts]#
Let’s put the gradient descent implementation together and create a linear regression model class. Following the same pattern as we did in Worksheet 1 and Activity 4, we’ll create a class that inherits from scikit-learn’s BaseEstimator and implements the fit and predict methods.
ML model class documentation
As Guido van Rossum, the creator of Python, likes to say:
Code is read much more often than it is written.
Part of growing as a computer scientist or data scientist is to be able to effectively communicate your implementation to others.
As such, a part of the homework assignments will be completing the documentation of the machine learning model classes you implement in this course.
Please make sure that you document every method parameter, and provide descriptions of the class and its methods.
The docstrings shown in the methods in Part 1 and in Worksheet 1 follow the Google Python Style Guide, and you can use them as examples for your own documentation.
class MHCLinearRegressor(BaseEstimator):
"""TODO class description"""
def __init__(self, alpha: float, max_iters: int=5000):
"""TODO constructor description"""
# TODO initialize the hyperparameters of alpha and max_iters
pass
def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
"""TODO method description"""
# TODO save the fitted weights, loss values, and weight history by
# calling the linreg_grad_descent function with the correct parameters
self.weights_, self.loss_values_, self.w_history_ = None, None, None
def predict(self, X: np.ndarray) -> np.ndarray:
"""TODO method description"""
# TODO use the fitted weights to make predictions on the input data
pass
if __name__ == "__main__":
# Initialize some simple data for testing, n=3, p=2
X = np.array([[1, 2],
[2, 3],
[3, 3]])
y = np.array([0, 1, 2])
alpha = 0.05
# Test the linear regression model
model = MHCLinearRegressor(alpha=0.05)
model = model.fit(X, y)
assert model is not None, "The model should be fitted and returned in fit()"
assert np.allclose(model.predict(X), y, atol=1e-3), "The predictions should be equal to the y targets"
4. Analysis [2 pts]#
4.1 Gradient descent alpha simulation [0.5 pts]#
You’ll now use your newly implemented MHCLinearRegressor class to explore the effect of the learning rate \(\alpha\) on gradient descent convergence.
Run the code cell below to see an interactive plot of the gradient descent algorithm path for fitting your MHCLinearRegressor model with different values of \(\alpha\). The plot shows the contour plot along with how the weights update at each iteration (represented by the smaller white circles), with the title showing the final MSE loss and the number of iterations needed to converge.
if __name__ == "__main__":
import sys; sys.path.insert(0, '..')
from utils import explore_alpha
import ipywidgets as widgets
widgets.interact_manual(explore_alpha,
# Tells the widget to use the MHCLinearRegressor class
LinModel=widgets.fixed(MHCLinearRegressor),
# Creates an interactive slider for the learning rate
alpha=widgets.FloatSlider(value=0.1, min=0.1, max=0.66, step=0.05)
);
4.1.1: Increase the learning rate by increments of 0.05 from 0.1 to 0.65. Comment on what you see in both the number of iterations needed to converge as well as the “shape” of the gradient descent path as \(\alpha\) increases.
4.1.2: Now, slide the learning rate all the way to 0.66. What do you observe in the ending loss value and the start and end points of the gradient descent path?
4.1.3: Summarize the tradeoffs you see between setting a high vs low \(\alpha\), and propose a potential strategy for picking \(\alpha\) in practice (It’s okay to speculate here as long as you provide a rationale, as this is an open-ended question).
TODO your responses:
4.1.1:
4.1.2:
4.1.3:
5. “Data Neutrality” Ethics [1.5 pts]#
Alongside exploring the technical foundations of machine learning, we will also consider the ethics of machine learning within socio-technical systems. Here, we will examine the notion of “data neutrality.”
Prof. Catherine D’Ignazio interview on data neutrality
Read “Data is never a raw, truthful input, and it is never neutral” and answer the discussion questions below.
5.1: When the author says “data is never neutral,” what do they mean?
5.2: What are “who questions”? Why does the author advocate for using them?
5.3: Discuss who you believe has the responsibility to ask the “who questions.” Is it the person who funds the research? The researcher or engineer who collected the data? The data scientist or computer scientist who analyzed it? Other parties?
5.4: The creators of the bikeshare dataset included the following description:
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on a daily basis and then extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com.
Also note that Lyft bought the parent company of Capital Bikeshare in 2018. Building on the interview article, reflect on who benefits from using this dataset and a machine learning model that predicts bike rentals to guide decisions (e.g. where new stations go, how many bikes to stock, pricing changes, etc.) as well as who could be potentially harmed or overlooked, and why. Discuss these in brief paragraphs (~2-3 sentences) each.
TODO your responses:
5.1:
5.2:
5.3:
5.4:
Potential benefits:
Potential harms:
6. Reflection [0.5 pts]#
How much time did you spend on this assignment?
Were there any parts of the assignment that you found particularly challenging?
What is one thing you have a better understanding of after completing this assignment and going through the class content?
Do you have any follow-up questions about concepts that you’d like to explore further?
Indicate the number of late days (if any) you are using for this assignment.
TODO your responses here:
6.1:
6.2:
6.3:
6.4:
6.5:
How to submit
Like Worksheet 1, follow the instructions on the course website to submit your work. For all of Homework 1, your submission will include the files from both parts:
hw1_foundations.ipynbandhw1_foundations.pyhw1_models.ipynbandhw1_models.py
Acknowledgements#
The bikeshare dataset is sourced from the ISLP repository
The data ethics exercise is sourced from Yaniv Yacoby’s Probabilistic Foundations of ML course