HW 2 part 2#

Classification: Predictions

—TODO your name here

Collaboration Statement

  • TODO brief statement on the nature of your collaboration.

  • TODO your collaborator’s names here

Part 2 Table of Contents and Rubric#

Section

Points

Datasheets for Datasets

1.5

Data Preparation

1

Tuning and Prediction

2

Reflection

0.5

Total

5 pts

Notebook and function imports#

If you have tested your implementation in Part 1 against the autograder, you would have generated a file called hw2_foundations.py. Let’s now import those functions into this notebook for use in Part 2.

If you are running this notebook on the JupyterHub allocated for the course:

  1. Open the file browser by going to the menu bar “View -> File Browser”

  2. Navigate to comsc335.github.io/hws/, you should see your hw2_predictions.ipynb file in that folder

  3. Click on the upload button in the upper right and upload the hw2_foundations.py file to this directory

  4. Run the following cell below to import the functions.

import numpy as np
import pandas as pd
import seaborn as sns

# Import your implementations from Part 1
from hw2_foundations import MHCLogisticRegressor

Discussion questions

Whenever a question asks for a discussion, we are not necessarily looking for a particular answer. However, we are looking for engagement with the material, so one-word/one-phrase answers usually don’t give enough space to show your thought process. Try to explain your reasoning in ~1-2 full sentences.

4. Datasheets for Datasets [1.5 pts]#

As machine learning practitioners, we need to understand development process and intended purpose of the data we work with. Building off the idea that data is never “neutral” from last homework’s readings, we will now look at the Datasheets for Datasets framework, proposed by Timnit Gebru et al. in 2021, which provides a standardized set of questions for dataset documentation to help increase transparency and accountability in ML systems.

Gebru et al. 2021: Datasheets for Datasets

Read pg 86 - 89 of the Datasheets for Datasets paper, which covers through the Motivation and Composition questions of the datasheet. Then answer the questions below.

4.1: What are the two reasons the authors give for why a model might perform poorly “in the wild,” even if it performs well on a benchmark?

4.2: Describe the two key stakeholder groups and the primary objectives datasheets are designed to serve for each.

TODO your responses:

4.1:

4.2:

Next, we’ll take a closer look at the Adult Income dataset we began to work with in Worksheet 2.

Examining the context of the Adult Income dataset

Watch the first 8:01 of the folktables paper presentation: https://www.youtube.com/watch?v=KP7DhM_ahHI. This video discusses how the folktables package was created as a replacement for the widely-used UCI Adult Income dataset.

Look at the UCI Adult “Dataset Information” on the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/2/adult. This is the original documentation for the dataset created back in 1996 and is very sparse.

Identify two specific questions from the Datasheets Motivation or Composition sections (questions 1 - 19) that the original UCI Adult dataset page does not answer. For each, briefly explain (~1-2 sentences) why knowing the answer would be important for someone using this dataset in an ML system.

TODO your response:

4.3:

  • Question 1:

  • Question 2:

A note on sex in this dataset

The Adult (reconstructed) dataset, drawn from the 1994 US Census, records sex as a binary variable (Male/Female). While sex and gender are distinct concepts, both are more complex than a binary categorization captures. In the fair machine learning literature, sex and race are commonly studied as protected attributes: characteristics that models should not use to discriminate against individuals.

As ML practitioners, it is important to recognize that the categories in our data are shaped by the social and institutional contexts in which the data was collected. When we train models on data with binary sex categories, we build systems that cannot account for people who don’t fit neatly into those categories. This is one example of how dataset design decisions carry forward into the models we build.


Now that we have some more context on the Adult Income dataset, let’s prepare it for modeling. We’ll practice the foundations of classification in this assignment and then explore fairness topics in the future.

5. Data Preparation [1 pt]#

Before we can train a model, we need to prepare the data into a format our model can use. This involves:

  1. Loading the dataset and creating a binary target column

  2. Encoding categorical features as numeric columns (one-hot encoding)

  3. Standardizing numeric features so they are on a similar scale

  4. Splitting the data into training and test sets

Let’s start by loading the ACS Income (adult reconstructed) dataset.

5.1 prepare_data() [0.5 pts]#

Below is a partial implementation of the prepare_data() function. Complete the data preparation steps to:

  1. Create a binary income_>50k column (1 if income > 50000, 0 otherwise)

  2. One-hot encode the categorical columns using pd.get_dummies(): see Worksheet 2 as a reference

  3. Drop non-feature columns from the dataframe using drop(): see Worksheet 2 as a reference

The categorical columns in this dataset are: workclass, education, marital-status, occupation, race, sex, native-country.

We drop the following columns: income and relationship.

Tip

For both the drop() and get_dummies() methods, you can pass in the argument columns=[] to specify multiple columns to drop or encode in one go.

To help with gradient descent convergence, we often standardize the features so they are on a similar scale. After standardizing, the numeric features will have a mean of 0 and a standard deviation of 1. We will cover standardization as part of our coverage of data preprocessing later in the course, so we provide the code to do so here. The StandardScaler object follows the same fit() and transform() interface as the PolynomialFeatures object we saw in Activity 7.

from sklearn.preprocessing import StandardScaler


def prepare_data(df: pd.DataFrame) -> pd.DataFrame:
    """Prepare the income dataframe for modeling.

    Args:
        df: raw income dataframe

    Returns:
        X: feature matrix as numpy array
        y: target vector as numpy array
    """
    # Copy the dataframe to avoid modifying the original
    features_df = df.copy()

    # TODO create the binary target column
    features_df['income_>50k'] = None

    # TODO drop non-feature columns. You can use the drop() method with a list of column names.
    features_df = None

    # TODO: One-hot encode categorical columns using pd.get_dummies(dtype=int) 
    categorical_cols = ['workclass', 'education', 'marital-status', 'occupation', 'race', 'sex', 'native-country']
    features_df = None

    # Standardize numeric columns to be a similar range: mean 0, std 1
    numeric_cols = ['age', 'hours-per-week', 'capital-gain', 'capital-loss', 'education-num']
    scaler = StandardScaler()
    scaler.fit(features_df[numeric_cols])
    features_df[numeric_cols] = scaler.transform(features_df[numeric_cols])
    
    return features_df
if __name__ == "__main__":
    # Test prepare_data
    income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')
    income_features = prepare_data(income_df)
    assert 'income_>50k' in income_features.columns, "income_>50k column should be created"
    assert 'income' not in income_features.columns, "income column should be dropped"
    assert income_features.shape == (49531, 102), f"Expected feature_df shape (49531, 102), got {income_features.shape}"
    

5.2 holdout_split() [0.5 pts]#

Now we implement the holdout method by splitting our data into a training set (used to fit the model) and a test set (used to evaluate the model on unseen data). In practice, data should always be randomized (shuffled) before splitting into train and test sets. The CSV for this assignment has been pre-shuffled, so we don’t need to shuffle it ourselves. Write a function that performs this split using the index slicing we practiced in Activity 6.

def holdout_split(features_df: pd.DataFrame, y_column: str, train_frac: float = 0.5) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Split data into training and test sets via the holdout method.

    Args:
        features_df: pandas DataFrame of features
        y_column: name of the target column
        train_frac: fraction of data to use for training

    Returns:
        A tuple of (X_train, X_test, y_train, y_test) as numpy arrays
    """
    # Compute the number of training examples
    n_train = int(train_frac * features_df.shape[0])
    
    # TODO: split X and y into train and test sets using index slicing and n_train
    X_train = None
    X_test = None
    y_train = None
    y_test = None

    # TODO drop the y_column from X_train and X_test
    X_train = None
    X_test = None

    return X_train.to_numpy(), X_test.to_numpy(), y_train.to_numpy(), y_test.to_numpy()
if __name__ == "__main__":
    income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')
    income_features = prepare_data(income_df)
    # Test holdout_split with 50/50 split
    
    X_train, X_test, y_train, y_test = holdout_split(income_features, y_column='income_>50k', train_frac=0.5)
    
    assert X_train.shape[0] == 24765, "Training set should have 24765 examples with a 50/50 split"
    assert X_test.shape[0] == 24766, "Test set should have 24766 examples with a 50/50 split"
    assert X_train.shape[1] + 1 == income_features.shape[1], "income_>50k column should be dropped from X_train"
    assert X_test.shape[1] + 1 == income_features.shape[1], "income_>50k column should be dropped from X_test"
    assert y_train.shape[0] == X_train.shape[0], "y_train should have the same number of examples as X_train"

6. Prediction and Tuning [2 pts]#

Now that our data is prepared, let’s now follow the ML process to train and evaluate our logistic regression model.

6.1 Accuracy [0.5 pts]#

Write a function that computes classification accuracy, which is the fraction of predictions that match the true labels:

\[ \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(y_i = \hat{y}_i) \]

Hint

This expression can be translated almost directly into code by using numpy array boolean indexing and np.mean().

def compute_accuracy(y_pred: np.ndarray, y_true: np.ndarray) -> float:
    """Compute classification accuracy.

    Args:
        y_pred: predicted labels of shape (n,)
        y_true: true labels of shape (n,)

    Returns:
        accuracy as a float
    """
    # TODO your code here
    return None
if __name__ == "__main__":
    # Test compute_accuracy
    assert compute_accuracy(np.array([1, 0, 1, 1]), np.array([1, 0, 0, 1])) == 0.75, "compute_accuracy() should return 0.75 for this example"

6.2 \(\lambda\) sweep and the ML process [0.75 pts]#

The cell below contains the entire ML process:

\[ \text{Data} \rightarrow \text{Features} \rightarrow \text{Model} \rightarrow \text{Train} \rightarrow \text{Evaluate} \]

For our training process, we saw in Activity 7 that regularization helps prevent overfitting and practiced finding a good value for the L2 regularization hyperparameter. Let’s tune the lam hyperparameter for our logistic regression model by trying several values and picking the one with the best test accuracy. We’ll try out the following values using np.logspace(-5, 0, 6):

\[ \lambda \in \{10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1\} \]

Complete the code below to train and evaluate a model for each value of lam. Keep the following hyperparameters for your MHCLogisticRegressor models constant:

  • alpha=0.5

  • max_iters=2000

Runtime note

Each model may take 10-20 seconds to train depending on JupyterHub available resources.

from hw2_foundations import MHCLogisticRegressor

if __name__ == "__main__":
    # Data: load in the raw data
    income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')

    # TODO Features: call your prepare_data() function to featurize your data
    income_features = None

    # TODO Data/Features: call your holdout_split() function to split your data into train and test sets
    # y_column='income_>50k', and train_frac=0.5
    X_train, X_test, y_train, y_test = None

    # We'll tune the lambda hyperparameter for our model
    lams = np.logspace(-5, 0, 6)

    
    best_acc = 0
    # Save the best model to analyze later
    best_model = None

    for lam in lams:
        # TODO Model: initialize your MHCLogisticRegressor model from part 1 
        # Use: alpha=0.5, lam=lam, max_iters=2000
        model = None

        # TODO Train: call your model's fit() method to train on the training data

        # TODO Evaluate: compute train and test accuracy using your compute_accuracy() function
        train_acc = 0
        test_acc = 0

        # if the current model has the highest test accuracy so far, save it
        if test_acc > best_acc:
            best_acc = test_acc
            best_model = model

        print(f"  lambda={lam:.5f} | Train accuracy: {train_acc:.4f} | Test accuracy: {test_acc:.4f}")

Briefly discuss the results of your lambda sweep and report the best value of lambda. What value of lambda gave the best test accuracy? Was this best value clearly better than the other values of lambda, or was there a range of values that performed similarly?

Your response: TODO

6.3 Error analysis [0.75 pts]#

A model’s overall accuracy only tells part of the story. To understand how a model is making mistakes, it is useful to examine the examples it gets wrong. In binary classification, there are two types of errors:

  • False positives: the model predicts income \(>\) $50k, but the true label is \(\leq\) $50k

  • False negatives: the model predicts income \(\leq\) $50k, but the true label is > $50k

Let’s use the pandas boolean indexing operations we practiced in Worksheet 2 to examine these errors. Below is starter code to perform this analysis. We save the predictions from the best model in best_model to the “test” portion of the income_df. It then creates a test_df dataframe that contains the test set examples and the following columns:

  • predicted: the model’s predicted label (1 if income > 50k, 0 otherwise)

  • income_>50k: the true label (1 if income > 50k, 0 otherwise)

Complete the TODO below to create the false_pos and false_neg columns. You may need to re-run the cell above to make sure that the best_model variable is up to date.

if __name__ == "__main__":
    income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction_shuffled.csv')

    # Convert income to binary target: 1 if income > 50k, 0 otherwise
    income_df['income_>50k'] = (income_df['income'] > 50000).astype(int)

    # Create the test portion of the dataframe based on the shape of the training set
    n_train = X_train.shape[0]
    test_df = income_df[n_train:].copy()

    # Save predictions from the best model to the "test" portion of the income_df dataframe into the "predicted" column
    # predicted = 1 if the model predicts income > 50k, 0 otherwise
    test_df['predicted'] = best_model.predict(X_test)

    # TODO create false_pos and false_neg columns
    # The "predicted" column contains the model's predictions, and the "income_>50k" column contains the true labels.
    # Make sure to wrap your boolean expressions in parentheses for correct evaluation.
    test_df['false_pos'] = None
    test_df['false_neg'] = None
if __name__ == "__main__":
    # test that false_pos and false_neg are correct
    false_pos_df = test_df[test_df['false_pos'] == 1]
    false_neg_df = test_df[test_df['false_neg'] == 1]
    assert np.all(false_pos_df['predicted'] == 1), "false_pos_df should only contain examples where the model predicted 1"
    assert np.all(false_pos_df['income_>50k'] == 0), "false_pos_df should only contain examples where the true label is 0"
    assert np.all(false_neg_df['predicted'] == 0), "false_neg_df should only contain examples where the model predicted 0"
    assert np.all(false_neg_df['income_>50k'] == 1), "false_neg_df should only contain examples where the true label is 1"

Now, pick one of the categorical features below to examine in the widget below:

  • sex

  • race

  • education

  • occupation

  • workclass

if __name__ == "__main__":
    import sys; sys.path.insert(0, '..')
    from utils import explore_categorical_errors
    import ipywidgets as widgets

    categorical_cols = ['workclass', 'education', 'occupation', 'race', 'sex']
    error_types = ['False Positive Rate (%)', 'False Negative Rate (%)']

    widgets.interact(explore_categorical_errors, 
        # Tells the widget to use the test_df dataframe you created above
        df=widgets.fixed(test_df), 
        # Creates a dropdown for the column name
        column_name=categorical_cols,
        # Creates a dropdown for the error type
        error_type=error_types
    );

Use the widget to look at the false positive rate and false negative rate for each category within your chosen feature:

  • The false positive rate is the percentage of actually <$50k income individuals that the model incorrectly predicts as >$50k.

  • The false negative rate is the percentage of actually >$50k income individuals that the model incorrectly predicts as <$50k.

6.3.1: For both the false positive rate and the false negative rate, compare that rate across the categories of your chosen feature. Are the rates similar across categories, or are there notable differences? Which categories have the highest rate, and which have the lowest?

Your response:

  • Chosen feature: TODO

  • False positive rate: TODO

  • False negative rate: TODO

6.3.2: Suppose this model was used to make a real-world decision: approving loan applications based on an income eligibility cutoff, where individuals with income above $50k are more likely to be approved and those below are more likely to be rejected.

Briefly discuss what the differences you observed above might mean for individuals’ chances of loan approval in the affected categories (~2-3 sentences). It’s okay to speculate here as long as you provide some rationale, as this is an open-ended question.

Note

You may notice that the types of errors the model makes are not evenly distributed across groups of individuals. In the upcoming weeks, we’ll discuss ML fairness and evaluation tools for investigating these kinds of disparities, interpreting them in context, and considering the real-world impacts of ML-assisted decision-making.

Your response:

TODO


7. Reflection [0.5 pts]#

  1. How much time did you spend on this assignment?

  2. Were there any parts of the assignment that you found particularly challenging?

  3. What is one thing you have a better understanding of after completing this assignment and going through the class content?

  4. Do you have any follow-up questions about concepts that you’d like to explore further?

  5. Indicate the number of late days (if any) you are using for this assignment.

TODO your responses here:

7.1:

7.2:

7.3:

7.4:

7.5:

How to submit

Follow the instructions on the course website to submit your work. For part 2, you will submit hw2_predictions.ipynb and hw2_predictions.py.