HW 2 part 2#
Classification: Predictions
—TODO your name here
Collaboration Statement
TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here
Part 2 Table of Contents and Rubric#
Section |
Points |
|---|---|
Datasheets for Datasets |
1.5 |
Data Preparation |
1 |
Tuning and Prediction |
2 |
Reflection |
0.5 |
Total |
5 pts |
Notebook and function imports#
If you have tested your implementation in Part 1 against the autograder, you would have generated a file called hw2_foundations.py. Let’s now import those functions into this notebook for use in Part 2.
If you are running this notebook on the JupyterHub allocated for the course:
Open the file browser by going to the menu bar “View -> File Browser”
Navigate to
comsc335.github.io/hws/, you should see yourhw2_predictions.ipynbfile in that folderClick on the upload button in the upper right and upload the
hw2_foundations.pyfile to this directoryRun the following cell below to import the functions.
import numpy as np
import pandas as pd
import seaborn as sns
# Import your implementations from Part 1
from hw2_foundations import MHCLogisticRegressor
Discussion questions
Whenever a question asks for a discussion, we are not necessarily looking for a particular answer. However, we are looking for engagement with the material, so one-word/one-phrase answers usually don’t give enough space to show your thought process. Try to explain your reasoning in ~1-2 full sentences.
4. Datasheets for Datasets [1.5 pts]#
As machine learning practitioners, we need to understand development process and intended purpose of the data we work with. Building off the idea that data is never “neutral” from last homework’s readings, we will now look at the Datasheets for Datasets framework, proposed by Timnit Gebru et al. in 2021, which provides a standardized set of questions for dataset documentation to help increase transparency and accountability in ML systems.
Gebru et al. 2021: Datasheets for Datasets
Read pg 86 - 89 of the Datasheets for Datasets paper, which covers through the Motivation and Composition questions of the datasheet. Then answer the questions below.
4.1: What are the two reasons the authors give for why a model might perform poorly “in the wild,” even if it performs well on a benchmark?
4.2: Describe the two key stakeholder groups and the primary objectives datasheets are designed to serve for each.
TODO your responses:
4.1:
4.2:
Next, we’ll take a closer look at the Adult Income dataset we began to work with in Worksheet 2.
Examining the context of the Adult Income dataset
Watch the first 8:01 of the folktables paper presentation: https://www.youtube.com/watch?v=KP7DhM_ahHI. This video discusses how the folktables package was created as a replacement for the widely-used UCI Adult Income dataset.
Look at the UCI Adult “Dataset Information” on the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/2/adult. This is the original documentation for the dataset created back in 1996 and is very sparse.
Identify two specific questions from the Datasheets Motivation or Composition sections (questions 1 - 19) that the original UCI Adult dataset page does not answer. For each, briefly explain (~1-2 sentences) why knowing the answer would be important for someone using this dataset in an ML system.
TODO your response:
4.3:
Question 1:
Question 2:
A note on sex in this dataset
The Adult (reconstructed) dataset, drawn from the 1994 US Census, records sex as a binary variable (Male/Female). While sex and gender are distinct concepts, both are more complex than a binary categorization captures. In the fair machine learning literature, sex and race are commonly studied as protected attributes: characteristics that models should not use to discriminate against individuals.
As ML practitioners, it is important to recognize that the categories in our data are shaped by the social and institutional contexts in which the data was collected. When we train models on data with binary sex categories, we build systems that cannot account for people who don’t fit neatly into those categories. This is one example of how dataset design decisions carry forward into the models we build.
Now that we have some more context on the Adult Income dataset, let’s prepare it for modeling. We’ll practice the foundations of classification in this assignment and then explore fairness topics in the future.
5. Data Preparation [1 pt]#
Before we can train a model, we need to prepare the data into a format our model can use. This involves:
Loading the dataset and creating a binary target column
Encoding categorical features as numeric columns (one-hot encoding)
Standardizing numeric features so they are on a similar scale
Splitting the data into training and test sets
Let’s start by loading the ACS Income (adult reconstructed) dataset.
5.1 prepare_data() [0.5 pts]#
Below is a partial implementation of the prepare_data() function. Complete the data preparation steps to:
Create a binary
income_>50kcolumn (1 if income > 50000, 0 otherwise)One-hot encode the categorical columns using
pd.get_dummies(): see Worksheet 2 as a referenceDrop non-feature columns from the dataframe using
drop(): see Worksheet 2 as a reference
The categorical columns in this dataset are: workclass, education, marital-status, occupation, race, sex, native-country.
We drop the following columns: income and relationship.
Tip
For both the drop() and get_dummies() methods, you can pass in the argument columns=[] to specify multiple columns to drop or encode in one go.
To help with gradient descent convergence, we often standardize the features so they are on a similar scale. After standardizing, the numeric features will have a mean of 0 and a standard deviation of 1. We will cover standardization as part of our coverage of data preprocessing later in the course, so we provide the code to do so here. The StandardScaler object follows the same fit() and transform() interface as the PolynomialFeatures object we saw in Activity 7.
from sklearn.preprocessing import StandardScaler
def prepare_data(df: pd.DataFrame) -> pd.DataFrame:
"""Prepare the income dataframe for modeling.
Args:
df: raw income dataframe
Returns:
X: feature matrix as numpy array
y: target vector as numpy array
"""
# Copy the dataframe to avoid modifying the original
features_df = df.copy()
# TODO create the binary target column
features_df['income_>50k'] = None
# TODO drop non-feature columns. You can use the drop() method with a list of column names.
features_df = None
# TODO: One-hot encode categorical columns using pd.get_dummies(dtype=int)
categorical_cols = ['workclass', 'education', 'marital-status', 'occupation', 'race', 'sex', 'native-country']
features_df = None
# Standardize numeric columns to be a similar range: mean 0, std 1
numeric_cols = ['age', 'hours-per-week', 'capital-gain', 'capital-loss', 'education-num']
scaler = StandardScaler()
scaler.fit(features_df[numeric_cols])
features_df[numeric_cols] = scaler.transform(features_df[numeric_cols])
return features_df
if __name__ == "__main__":
# Test prepare_data
income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')
income_features = prepare_data(income_df)
assert 'income_>50k' in income_features.columns, "income_>50k column should be created"
assert 'income' not in income_features.columns, "income column should be dropped"
assert income_features.shape == (49531, 102), f"Expected feature_df shape (49531, 102), got {income_features.shape}"
5.2 holdout_split() [0.5 pts]#
Now we implement the holdout method by splitting our data into a training set (used to fit the model) and a test set (used to evaluate the model on unseen data). In practice, data should always be randomized (shuffled) before splitting into train and test sets. The CSV for this assignment has been pre-shuffled, so we don’t need to shuffle it ourselves. Write a function that performs this split using the index slicing we practiced in Activity 6.
def holdout_split(features_df: pd.DataFrame, y_column: str, train_frac: float = 0.5) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Split data into training and test sets via the holdout method.
Args:
features_df: pandas DataFrame of features
y_column: name of the target column
train_frac: fraction of data to use for training
Returns:
A tuple of (X_train, X_test, y_train, y_test) as numpy arrays
"""
# Compute the number of training examples
n_train = int(train_frac * features_df.shape[0])
# TODO: split X and y into train and test sets using index slicing and n_train
X_train = None
X_test = None
y_train = None
y_test = None
# TODO drop the y_column from X_train and X_test
X_train = None
X_test = None
return X_train.to_numpy(), X_test.to_numpy(), y_train.to_numpy(), y_test.to_numpy()
if __name__ == "__main__":
income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')
income_features = prepare_data(income_df)
# Test holdout_split with 50/50 split
X_train, X_test, y_train, y_test = holdout_split(income_features, y_column='income_>50k', train_frac=0.5)
assert X_train.shape[0] == 24765, "Training set should have 24765 examples with a 50/50 split"
assert X_test.shape[0] == 24766, "Test set should have 24766 examples with a 50/50 split"
assert X_train.shape[1] + 1 == income_features.shape[1], "income_>50k column should be dropped from X_train"
assert X_test.shape[1] + 1 == income_features.shape[1], "income_>50k column should be dropped from X_test"
assert y_train.shape[0] == X_train.shape[0], "y_train should have the same number of examples as X_train"
6. Prediction and Tuning [2 pts]#
Now that our data is prepared, let’s now follow the ML process to train and evaluate our logistic regression model.
6.1 Accuracy [0.5 pts]#
Write a function that computes classification accuracy, which is the fraction of predictions that match the true labels:
Hint
This expression can be translated almost directly into code by using numpy array boolean indexing and np.mean().
def compute_accuracy(y_pred: np.ndarray, y_true: np.ndarray) -> float:
"""Compute classification accuracy.
Args:
y_pred: predicted labels of shape (n,)
y_true: true labels of shape (n,)
Returns:
accuracy as a float
"""
# TODO your code here
return None
if __name__ == "__main__":
# Test compute_accuracy
assert compute_accuracy(np.array([1, 0, 1, 1]), np.array([1, 0, 0, 1])) == 0.75, "compute_accuracy() should return 0.75 for this example"
6.2 \(\lambda\) sweep and the ML process [0.75 pts]#
The cell below contains the entire ML process:
For our training process, we saw in Activity 7 that regularization helps prevent overfitting and practiced finding a good value for the L2 regularization hyperparameter. Let’s tune the lam hyperparameter for our logistic regression model by trying several values and picking the one with the best test accuracy. We’ll try out the following values using np.logspace(-5, 0, 6):
Complete the code below to train and evaluate a model for each value of lam. Keep the following hyperparameters for your MHCLogisticRegressor models constant:
alpha=0.5max_iters=2000
Runtime note
Each model may take 10-20 seconds to train depending on JupyterHub available resources.
from hw2_foundations import MHCLogisticRegressor
if __name__ == "__main__":
# Data: load in the raw data
income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')
# TODO Features: call your prepare_data() function to featurize your data
income_features = None
# TODO Data/Features: call your holdout_split() function to split your data into train and test sets
# y_column='income_>50k', and train_frac=0.5
X_train, X_test, y_train, y_test = None
# We'll tune the lambda hyperparameter for our model
lams = np.logspace(-5, 0, 6)
best_acc = 0
# Save the best model to analyze later
best_model = None
for lam in lams:
# TODO Model: initialize your MHCLogisticRegressor model from part 1
# Use: alpha=0.5, lam=lam, max_iters=2000
model = None
# TODO Train: call your model's fit() method to train on the training data
# TODO Evaluate: compute train and test accuracy using your compute_accuracy() function
train_acc = 0
test_acc = 0
# if the current model has the highest test accuracy so far, save it
if test_acc > best_acc:
best_acc = test_acc
best_model = model
print(f" lambda={lam:.5f} | Train accuracy: {train_acc:.4f} | Test accuracy: {test_acc:.4f}")
Briefly discuss the results of your lambda sweep and report the best value of lambda. What value of lambda gave the best test accuracy? Was this best value clearly better than the other values of lambda, or was there a range of values that performed similarly?
Your response: TODO
Click to check accuracy results
Depending on your MHCLogisticRegressor implementation and the available resources on JupyterHub, you should see a test accuracy in the range of 0.82-0.86 for the best values of lambda.
6.3 Error analysis [0.75 pts]#
A model’s overall accuracy only tells part of the story. To understand how a model is making mistakes, it is useful to examine the examples it gets wrong. In binary classification, there are two types of errors:
False positives: the model predicts income \(>\) $50k, but the true label is \(\leq\) $50k
False negatives: the model predicts income \(\leq\) $50k, but the true label is > $50k
Let’s use the pandas boolean indexing operations we practiced in Worksheet 2 to examine these errors. Below is starter code to perform this analysis. We save the predictions from the best model in best_model to the “test” portion of the income_df. It then creates a test_df dataframe that contains the test set examples and the following columns:
predicted: the model’s predicted label (1 if income > 50k, 0 otherwise)income_>50k: the true label (1 if income > 50k, 0 otherwise)
Complete the TODO below to create the false_pos and false_neg columns. You may need to re-run the cell above to make sure that the best_model variable is up to date.
if __name__ == "__main__":
income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')
# Convert income to binary target: 1 if income > 50k, 0 otherwise
income_df['income_>50k'] = (income_df['income'] > 50000).astype(int)
# Create the test portion of the dataframe based on the shape of the training set
n_train = X_train.shape[0]
test_df = income_df[n_train:].copy()
# Save predictions from the best model to the "test" portion of the income_df dataframe into the "predicted" column
# predicted = 1 if the model predicts income > 50k, 0 otherwise
test_df['predicted'] = best_model.predict(X_test)
# TODO create false_pos and false_neg columns
# The "predicted" column contains the model's predictions, and the "income_>50k" column contains the true labels.
# Make sure to wrap your boolean expressions in parentheses for correct evaluation.
test_df['false_pos'] = None
test_df['false_neg'] = None
if __name__ == "__main__":
# test that false_pos and false_neg are correct
false_pos_df = test_df[test_df['false_pos'] == 1]
false_neg_df = test_df[test_df['false_neg'] == 1]
assert np.all(false_pos_df['predicted'] == 1), "false_pos_df should only contain examples where the model predicted 1"
assert np.all(false_pos_df['income_>50k'] == 0), "false_pos_df should only contain examples where the true label is 0"
assert np.all(false_neg_df['predicted'] == 0), "false_neg_df should only contain examples where the model predicted 0"
assert np.all(false_neg_df['income_>50k'] == 1), "false_neg_df should only contain examples where the true label is 1"
Now, pick one of the categorical features below to examine in the widget below:
sexraceeducationoccupationworkclass
if __name__ == "__main__":
import sys; sys.path.insert(0, '..')
from utils import explore_categorical_errors
import ipywidgets as widgets
categorical_cols = ['workclass', 'education', 'occupation', 'race', 'sex']
error_types = ['False Positive Rate (%)', 'False Negative Rate (%)']
widgets.interact(explore_categorical_errors,
# Tells the widget to use the test_df dataframe you created above
df=widgets.fixed(test_df),
# Creates a dropdown for the column name
column_name=categorical_cols,
# Creates a dropdown for the error type
error_type=error_types
);
Use the widget to look at the false positive rate and false negative rate for each category within your chosen feature:
The false positive rate is the percentage of actually <$50k income individuals that the model incorrectly predicts as >$50k.
The false negative rate is the percentage of actually >$50k income individuals that the model incorrectly predicts as <$50k.
6.3.1: For both the false positive rate and the false negative rate, compare that rate across the categories of your chosen feature. Are the rates similar across categories, or are there notable differences? Which categories have the highest rate, and which have the lowest?
Your response:
Chosen feature: TODO
False positive rate: TODO
False negative rate: TODO
6.3.2: Suppose this model was used to make a real-world decision: approving loan applications based on an income eligibility cutoff, where individuals with income above $50k are more likely to be approved and those below are more likely to be rejected.
Briefly discuss what the differences you observed above might mean for individuals’ chances of loan approval in the affected categories (~2-3 sentences). It’s okay to speculate here as long as you provide some rationale, as this is an open-ended question.
Note
You may notice that the types of errors the model makes are not evenly distributed across groups of individuals. In the upcoming weeks, we’ll discuss ML fairness and evaluation tools for investigating these kinds of disparities, interpreting them in context, and considering the real-world impacts of ML-assisted decision-making.
Your response:
TODO
7. Reflection [0.5 pts]#
How much time did you spend on this assignment?
Were there any parts of the assignment that you found particularly challenging?
What is one thing you have a better understanding of after completing this assignment and going through the class content?
Do you have any follow-up questions about concepts that you’d like to explore further?
Indicate the number of late days (if any) you are using for this assignment.
TODO your responses here:
7.1:
7.2:
7.3:
7.4:
7.5:
How to submit
Follow the instructions on the course website to submit your work. For part 2, you will submit hw2_predictions.ipynb and hw2_predictions.py.