(Video) lecture 21: ML engineering “from scratch”#

ML process

Our data: Austin Animal Center is the largest no-kill animal shelter in the US.

Question and prediction task: The shelter wants to know, at the moment an animal walks in, which animals are at risk of a bad outcome (e.g. euthanasia, died, missing) versus a good outcome (e.g. adoption, transfer, return to owner). The goal is to help staff prioritize interventions: foster placement, medical care, or adoption listings.

We will see two new pandas concepts:

  1. pd.merge for combining two tables on a shared key

  2. Missingness with .isna() / .fillna(), plus one-hot indicators for missing categoricals

0. Imports#

We begin with our usual imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

1. Data#

We have two CSV files:

  • intakes.csv, one row per animal, with information recorded at intake time (species, sex, intake type, condition, age, breed, color)

  • outcomes.csv, one row per animal, with the outcome of their shelter stay (adopted, transferred, euthanized, etc.)

We’ll need to combine them to build our dataset.

intakes_df = pd.read_csv("~/COMSC-335/data/intakes.csv")
outcomes_df = pd.read_csv("~/COMSC-335/data/outcomes.csv")
outcomes_df.head()
Animal ID Outcome Type
0 A704961 Adoption
1 A676160 Transfer
2 A718092 Return to Owner
3 A892519 Adoption
4 A697724 Adoption
intakes_df.head()
Animal ID Animal Type Sex upon Intake Intake Type Intake Condition Age upon Intake Breed Color
0 A708396 Dog Intact Female Stray Normal 1 year Pit Bull Mix Black/White
1 A735529 Dog Spayed Female Stray Normal 3 years Labrador Retriever Mix Brown/White
2 A762369 Cat Spayed Female Owner Surrender Normal 8 years Domestic Shorthair Mix Brown Tabby
3 A814930 Cat Neutered Male Owner Surrender Normal 3 years Domestic Longhair Mix Brown Tabby
4 A796971 Dog Unknown Stray Normal 1 week Labrador Retriever Mix Brown

New concept: pd.merge#

Pandas has a merge() method that works like a SQL join. Given two DataFrames and a shared key column, it lines up rows that have matching keys.

# Toy example: left table has features, right table has labels.
left_df = pd.DataFrame({
    "id": ["A", "B", "C", "D"],
    "color": ["red", "blue", "green", "red"],
})

right_df = pd.DataFrame({
    "id": ["A", "B", "C"],
    "label": [1, 0, 1],
})
left_df.merge(right_df, on="id", how="left")
id color label
0 A red 1.0
1 B blue 0.0
2 C green 1.0
3 D red NaN

The how="inner" argument keeps only rows where the key appears in both tables, so that row "D" disappears from the output.

The full merge options are:

  • "inner": keep only rows where the key appears in both tables

  • "left": keep all rows from the left table, and only matching rows from the right table

  • "right": keep all rows from the right table, and only matching rows from the left table

  • "outer": keep all rows from both tables, and fill missing values with NaN

For our shelter data we want inner: we only care about animals where we know both the intake info and the final outcome. Let’s merge on Animal ID:

df = intakes_df.merge(outcomes_df, on="Animal ID", how="inner")
intakes_df.shape
(20000, 8)

2. Exploring the data and building our target#

The merged table has the features and the labels, but our target doesn’t exist as a clean binary column yet, so we have to build it. Look at the outcome column first:

df['Outcome Type'].value_counts(dropna=False)
Outcome Type
Adoption           9239
Transfer           5691
Return to Owner    2487
Euthanasia         1226
Died                196
Rto-Adopt           129
Disposal             90
Missing              10
NaN                   4
Relocate              2
Lost                  1
Name: count, dtype: int64

Building the binary target#

We’ll call an outcome “bad” if the animal was euthanized, died, disposed of, reported missing, or lost. Everything else (adopted, transferred to a rescue, returned to owner, relocated) we’ll call “good”.

Like other class labels we’ve seen, this can be subjective! “Transferred” could mean a no-kill rescue partner (good) or another crowded facility (less good).

Tip

Pressing shift-tab in the notebook will (sometimes) show the docstring for the object under your cursor.

We then need to check for missing outcome values:

BAD_OUTCOMES = ["Euthanasia", "Died", "Missing", "Disposal", "Lost"]

df['bad_outcome'] = df["Outcome Type"].isin(BAD_OUTCOMES).astype(int)
df['bad_outcome'].value_counts(normalize=True)
bad_outcome
0    0.920157
1    0.079843
Name: proportion, dtype: float64
sns.countplot(data=df, x="bad_outcome")
<Axes: xlabel='bad_outcome', ylabel='count'>
../_images/8364d087fd5facc67c6c2ff0b86ded12dd1fef8575e327e9ea3c1eaa6c6de375.png

Missingness in data#

In practice, most real datasets have gaps. Pandas represents missing values as NaN. Two quick helpers for spotting it:

  • df.isna() returns a boolean DataFrame the same shape as df, with True wherever a value is missing

  • df.isna().sum() gives you a per-column count of missing values (since True sums as 1)

When we build X, we also pass dummy_na=True to pd.get_dummies so missing categoricals get explicit _nan columns instead of only all-zero level dummies.

toy_df = pd.DataFrame({
    "name":  ["Bella", "Speedy", "King"],
    "age":   [3, None, 7],
    "breed": ["Lab", "Tabby", None],
})

toy_df.isna().sum()
name     0
age      1
breed    1
dtype: int64
1128 / df.shape[0]
0.05913499344692005
df.isna().sum()
Animal ID              0
Animal Type            0
Sex upon Intake        0
Intake Type            0
Intake Condition    1128
Age upon Intake        0
Breed                  0
Color                  0
Outcome Type           4
bad_outcome            0
dtype: int64
df["Intake Condition"].unique()
array(['Normal', 'Sick', 'Injured', 'Nursing', nan, 'Neonatal', 'Medical',
       'Behavior', 'Aged', 'Feral', 'Other', 'Unknown', 'Pregnant',
       'Parvo', 'Med Attn', 'Neurologic', 'Med Urgent'], dtype=object)

One last look before we build features: is Intake Condition actually predictive of outcome?

df.groupby("Intake Condition")["bad_outcome"].mean().sort_values()
Intake Condition
Behavior      0.000000
Med Attn      0.000000
Neurologic    0.000000
Parvo         0.000000
Pregnant      0.000000
Neonatal      0.042254
Normal        0.046188
Feral         0.058824
Nursing       0.062084
Medical       0.062500
Other         0.105263
Aged          0.163265
Injured       0.313351
Sick          0.380952
Med Urgent    0.500000
Unknown       0.500000
Name: bad_outcome, dtype: float64
df.head()
Animal ID Animal Type Sex upon Intake Intake Type Intake Condition Age upon Intake Breed Color Outcome Type bad_outcome
0 A708396 Dog Intact Female Stray Normal 1 year Pit Bull Mix Black/White Adoption 0
1 A735529 Dog Spayed Female Stray Normal 3 years Labrador Retriever Mix Brown/White Return to Owner 0
2 A762369 Cat Spayed Female Owner Surrender Normal 8 years Domestic Shorthair Mix Brown Tabby Adoption 0
3 A814930 Cat Neutered Male Owner Surrender Normal 3 years Domestic Longhair Mix Brown Tabby Adoption 0
4 A796971 Dog Unknown Stray Normal 1 week Labrador Retriever Mix Brown Transfer 0

3. Features and train/test split#

Now we build the feature matrix X and label vector y.

Important: we can only use information available at intake time.

We’ll drop columns Breed and Color because they have too many unique values to use get_dummies on.

feature_cols = [
    'Animal Type',
    'Sex upon Intake',
    'Intake Type',
    'Intake Condition',
    'Age upon Intake',
]

X_df = df[feature_cols].copy()
y = df['bad_outcome'].copy()

Parsing age#

Age upon Intake is a string like "2 years" or "3 weeks", but ML models want numbers.

We’ll write a small helper that parses any of these formats into days, then use .apply() to run it over the whole column:

df['Age upon Intake'].unique()
array(['1 year', '3 years', '8 years', '1 week', '5 months', '3 weeks',
       '4 years', '1 weeks', '1 month', '2 years', '8 months', '4 weeks',
       '6 months', '9 years', '7 months', '5 years', '10 years',
       '3 months', '2 months', '4 days', '1 day', '4 months', '13 years',
       '7 years', '11 years', '2 weeks', '6 years', '10 months', '2 days',
       '3 days', '9 months', '5 weeks', '6 days', '18 years', '17 years',
       '0 years', '19 years', '11 months', '12 years', '5 days',
       '16 years', '14 years', '15 years', '22 years'], dtype=object)
unit_to_days = {
    'day': 1,
    'week': 7,
    'month': 30,
    'year': 365
}

print("1 year".split())
print("years".rstrip("s"))

def parse_age(age_str: str) -> int:
    # tuple unpack the number and the unit
    num, unit = age_str.split()
    unit = unit.rstrip("s")

    return int(num) * unit_to_days[unit]

# some test cases
assert parse_age("1 year") == 365
assert parse_age("5 months") == 5 * 30
['1', 'year']
year
X_df["age_days"] = X_df["Age upon Intake"].apply(parse_age)
X_df = X_df.drop(columns=["Age upon Intake"])

One-hot encoding the categorical columns#

Let’s look at the four categorical columns we’ll use:

  • Animal Type

  • Sex upon Intake

  • Intake Type

  • Intake Condition

categorical_cols = ["Animal Type", "Sex upon Intake", "Intake Type", "Intake Condition"]

Four categorical columns (Animal Type, Sex upon Intake, Intake Type, Intake Condition) all have a small number of observed levels, so pd.get_dummies can be used to generate features. We also set dummy_na=True so any NaN in those columns becomes an explicit missing-indicator column (for example Intake Condition_nan).

X_df = pd.get_dummies(X_df, categorical_cols, dummy_na=True)
X_df.head()
age_days Animal Type_Bird Animal Type_Cat Animal Type_Dog Animal Type_Livestock Animal Type_Other Animal Type_nan Sex upon Intake_Intact Female Sex upon Intake_Intact Male Sex upon Intake_Neutered Male ... Intake Condition_Neonatal Intake Condition_Neurologic Intake Condition_Normal Intake Condition_Nursing Intake Condition_Other Intake Condition_Parvo Intake Condition_Pregnant Intake Condition_Sick Intake Condition_Unknown Intake Condition_nan
0 365 False False True False False False True False False ... False False True False False False False False False False
1 1095 False False True False False False False False False ... False False True False False False False False False False
2 2920 False True False False False False False False False ... False False True False False False False False False False
3 1095 False True False False False False False False True ... False False True False False False False False False False
4 7 False False True False False False False False False ... False False True False False False False False False False

5 rows × 37 columns

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42, stratify=y)
print(X_test.shape)
(3815, 37)

4. Model, train, evaluate#

A first baseline: how far does one feature take us?#

Before we use all our features at a model, let’s see how well a single feature does.

We’ll fit a logistic regression on age_days alone and measure ROC-AUC on the test set.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe_lr_age = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=1000, random_state=42))
])

pipe_lr_age.fit(X_train[["age_days"]], y_train)

lr_age_scores = pipe_lr_age.predict_proba(X_test[["age_days"]])[:, 1]
roc_auc_score(y_test, lr_age_scores)
0.6143687824015693

Adding all the intake features#

We’ll rerun the same logistic regression, but now give it every feature we built in section 3.

pipe_lr_all = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=1000, random_state=42))
])

pipe_lr_all.fit(X_train, y_train)

lr_all_scores = pipe_lr_all.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, lr_all_scores)
0.906825463546775

Another model: Random Forest#

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

rf_scores = rf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, rf_scores)
0.9096567185091776

Comparing all three models with ROC curves#

Let’s put them on one ROC plot to see the rankings side by side.

from sklearn.metrics import roc_curve
fig, ax = plt.subplots()

models = {
    "Logistic regression, age only": lr_age_scores,
    "Logistic regression, all features": lr_all_scores,
    "random forest, all features": rf_scores
}

for model, scores in models.items():
    fpr, tpr, _ = roc_curve(y_test, scores)
    auc =  roc_auc_score(y_test, scores)
    sns.lineplot(x=fpr, y=tpr, ax=ax, label=f"{model}: {auc:.3f}")
../_images/fcb6c7c2723a69691bc5615a3943a808923a1d3cc0e2905b1017c83770cd2061.png

5. Reproducibility check#

An important habit for code hygiene: every time you finish a notebook, restart the kernel and run all cells from top to bottom. This prevents:

  1. Cell-order bugs. Defining a variable in cell 12, then modified it in cell 5 after scrolling up. The notebook as a linear script no longer reproduces your results.

  2. Hidden randomness. Forgetting to set random_state somewhere. Restart + run all lets you verify the final metric is stable across runs.

Keep random_state=42 on every step that involves randomness to ensure reproducibility: train_test_split, LogisticRegression, RandomForestClassifier

We’ll use Kernel -> Restart Kernel and Run All Cells to let the notebook rerun top to bottom.

Wrap-up#

  1. Data: two raw CSVs, merged with pd.merge on a shared key.

  2. Features: target engineering (collapsing outcome categories into binary), handling missingness, categorical encoding with pd.get_dummies, custom parsing for the age string.

  3. Model + Train + Evaluate: three models compared on ROC-AUC: a one-feature LR, a full-feature LR, and a Random Forest.

  4. Reproducibility: seeds, restart and run all, assertion on the final metric.