Worksheet 2

Worksheet 2#

Datasets and Probability

—TODO your name here

Collaboration Statement

TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here.

Learning Objectives#

Learn about pandas and seaborn for dataset manipulation and visualization.
Practice with probability concepts needed for the course:
- Discrete random events and expectation
- Contingency tables and conditional probabilities
Familiarization with broadcasting and axis operations in numpy.

1. Pandas for tabular data [2 pts]#

Pandas is the defacto Python framework for working with tabular data, and it is supported by a large ecosystem of libraries, including integration with NumPy. Pandas provides two main data structures:

DataFrame: a 2-dimensional data structure often used to represent a table with rows and named columns. We can think of a DataFrame as a 2D numpy array with named columns.
Series: a 1-dimensional, labelled array, often used to represent a single column or row in a DataFrame

We will primarily be using pandas to load in and analyze datasets before we pass them into our machine learning models.

To familiarize ourselves with working with the pandas library, we will load US Census data provided by the folktables package and perform some fundamental pandas operations.

From the creators of folktables:

Folktables is a Python package that provides access to datasets derived from the US Census, facilitating the benchmarking of machine learning algorithms. The package includes a suite of pre-defined prediction tasks in domains including income, employment, health, transportation, and housing, and also includes tools for creating new prediction tasks of interest in the US Census data ecosystem. The package additionally enables systematic studies of the effect of distribution shift, as each prediction task can be instantiated on datasets spanning multiple years and all states within the US.

Why the name? Folktables is a neologism describing tabular data about individuals. It emphasizes that data has the power to create and shape narratives about populations and challenges us to think carefully about the data we collect and use.

We will study the context and history around machine learning research using US Census data in the upcoming weeks, but for now, let’s take a look at one particularly prominent dataset: the ACS (American Community Survey) Income dataset, which contains socioeconomic data about individuals in the US.

import numpy as np

# The standard import idiom for pandas
import pandas as pd

# Load in the ACS Income dataset
income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction.csv')

Note

To load data from files manually, pandas provides various pd.read_* functions. For example, pd.read_csv loads (comma-separated values) CSV files. See pandas’ I/O documentation for more options.

We often suffix variable names with _df to make it clear that we are working with a dataframe.

For high-level inspection of the dataframe, we can use the following functions and attributes:

df.head(): returns the first 5 rows of the dataframe
df.tail(): returns the last 5 rows of the dataframe
df.info(): returns a summary of the dataframe, including the number of rows, columns, and the data types of each column
df.columns: returns the column names of the dataframe
df.shape: returns the number of rows and columns in the dataframe
df.dtypes: returns the data types of each column

You can play around with the dataframe in the cell below:

income_df.head()

# Like numpy, pandas provides a `shape` attribute that returns a tuple of (num rows, num columns)
income_df.shape

From above, we see that the dataframe has 49,531 rows and 14 columns. Here, each row represents an individual, and each column represents a feature of the individual. The machine learning task is to predict the income of an individual based on the other features.

Taking a look at the columns, we see that there are a wide range of demographic and occupational features collected:

income_df.columns

Column selection and filtering#

To select a single column, we can square bracket indexing with the name of the column:

# Selects the 'education' column and prints the first 5 rows
income_df['education'].head()

When initially exploring a dataset, it is often useful to see the unique values in a column:

# Get the unique values in the 'education' column
income_df['education'].unique()

The square bracket indexing can be generalized to selecting multiple columns by passing a list of column names:

# Selects multiple columns and prints the last 10 rows
cols = ['education', 'age', 'hours-per-week']
income_df[cols].tail(10)

We can also remove columns by using the drop method:

# Drop the 'relationship' column
income_df = income_df.drop(columns=['relationship'])

Tip

Operations that modify the dataframe will return a new dataframe with the changes, and the original dataframe will not be modified. So if we want to modify the dataframe in place, we need to assign the result back to the original variable:

income_df = income_df.drop(columns=['relationship'])

Just like NumPy, we can also use boolean indexing to select portions of the dataframe based on a condition:

# Selects individuals who are below the age of 30
sel_df = income_df[income_df['age'] < 30]

sel_df['age'].unique()

We can then use the value_counts function to get the frequency of each category in a column:

# Print the value counts of the 'workclass' column for individuals below the age of 30
print(sel_df['workclass'].value_counts())

# Normalize=True to get the proportion of each category
print(sel_df['workclass'].value_counts(normalize=True))

These boolean conditions can be combined using the & (AND), | (OR), and ~ (NOT) operators. Additionally, there are some special functions that can be used to select data based on a condition:

# Combining conditions: respondents who are less than 30 years old AND have non-zero capital-gain
income_df[(income_df['age'] < 30) 
        & (income_df['capital-gain'] > 0)]

# isin(): select rows where a column is in a list of values
income_df[income_df['workclass'].isin(['Local-gov', 'State-gov'])]

Tip

To avoid errors, always use parentheses when combining conditions:

Incorrect: df[column1 == value1 & column2 == value2]
Correct: df[(column1 == value1) & (column2 == value2)]

1.1. Let’s practice combining these operations. Select respondents who:

work full-time, defined as hours-per-week >= 40 AND
have an education level in ['Bachelors', 'Masters', 'Doctorate']

Within this group, compute the counts of occupation using value_counts(). This column indicates the type of job the respondent has.

How many individuals are in this group? What is the most common occupation in this group?

Your response: TODO

# TODO your code here (make sure to use income_df for the dataframe)
full_time_at_least_bachelors_df = None

Solution (click to check once you’ve completed 1.1)

You should see that the group of respondents who work full-time and have an education level defined above has 9,463 individuals. The most common occupation in this group is Prof-specialty with 3,287 individuals, which corresponds to some professional specialization role (not an academic professor).

We can also use boolean logic to create new columns. For example, a common age cutoff used in US policy studies is 65, as it is the age that corresponds to “seniors” and is when people are eligible for Medicare health insurance:

# Create a new column 'is_senior' that is 1 if the respondent is 65 or older, and 0 otherwise
# .astype(int) converts the boolean values into 0s and 1s
income_df['is_senior'] = (income_df['age'] >= 65).astype(int)

income_df['is_senior'].value_counts()

We’ll commonly apply this transformation to a continuous variable to create a binary indicator column, such as when we want to create a $y$ column for a binary classification task.

1.2 Complete the binarize_column function below. This function takes a pandas Series and a cutpoint as arguments, and returns a Series with 1s and 0s indicating whether each element of the input series is greater than the cutpoint:

def binarize_column(column: pd.Series, cutpoint: float) -> pd.Series:
    """
    Binarizes a column based on a cutpoint.

    Args:
        column (pd.Series): The column to binarize.
        cutpoint (float): The cutpoint to use for binarization.

    Returns:
        pd.Series: A column with 1s and 0s indicating whether each element of the input series is greater than the cutpoint.
    """
    # TODO your code here
    pass

#### Test binarize_column ####
if __name__ == "__main__":
    df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
    df['A_bin'] = binarize_column(df['A'], 2.5)
    assert df.equals(pd.DataFrame({'A': [1, 2, 3, 4, 5], 'A_bin': [0, 0, 1, 1, 1]})), "Binarization is incorrect"

Grouping and aggregation#

The groupby() operation is a powerful tool for performing aggregations on subsets of the dataframe. We pass in one or more columns as the by argument, which then divides the original Dataframe based on the unique values of the column(s). We then often apply an aggregation function to each group, resulting in a new dataframe.

You can think of it as:

split the dataframe into groups (based on unique values of a column), then
apply an aggregation (like mean, median, size, etc.) to each group.

We can replicate value_counts() with groupby() and the size() aggregation function:

# Count number of people in each workclass category
income_df.groupby(by='workclass').size()

We can also compute summary statistics like mean(), std(), median(), min(), max() on columns after grouping:

# First, group by the 'workclass' column
# Then, see the average number of work hours per week for each group
income_df.groupby('workclass')['hours-per-week'].mean()

To apply multiple aggregation functions, we can pass in a dictionary of columns as keys and functions as values to agg():

# we want to apply the following aggregations:
agg_dict = {
    # mean, min, max for hours-per-week
    'hours-per-week': ['mean', 'min', 'max'],
    # median for education_num
    'education-num': ['median']
}

# apply the aggregations, grouping by 'workclass'
income_df.groupby('workclass').agg(agg_dict)

1.3. Compute a summary table that groups by the is_senior column and then computes the following aggregations:

mean and median income
mean and median capital-gain

What is the mean income for seniors? Do seniors have more or less mean capital-gain than non-seniors?

Your response: TODO

# TODO your code here

Solution (click to check once you’ve completed 1.3)

The groupby aggregation table should show that seniors have a mean income of about $29,670. Seniors have a higher mean capital gain of $1,877, compared to non-seniors with a mean capital gain of $1,029.

Processing categorical variables with `get_dummies`#

Many machine learning models expect purely numeric or binary features, but we’ve seen so far that our dataset has a number of categorical features that are encoded as strings.

A common step taken to prepare data is one-hot encoding, where we convert a categorical column like workclass into a set of binary indicator columns: a new column is generated for each category within the column, called a dummy variable.

In pandas, there is the pd.get_dummies function that can be used to perform this transformation.

Let’s first see what the last 5 rows of the workclass column look like:

display(income_df['workclass'].tail())

Then, we’ll use pd.get_dummies to convert the workclass column into a set of binary indicator columns, one for each category in the column:

# Generate a new dataframe with binary columns for each category in 'workclass'
# dtype=int converts the boolean values into 0s and 1s
workclass_dummy_df = pd.get_dummies(
    data=income_df,
    columns=['workclass'],
    dtype=int
)

display(workclass_dummy_df)

Notice how in the workclass_dummy_df, each row has a 1 in the workclass_category column if the original workclass column had that category, and 0 otherwise. The columns argument can also be a list of columns to encode.

1.4. Use pd.get_dummies() to one-hot encode the occupation column in the income_df dataframe.

What is the shape of the new dataframe compared to the original income_df? What does that tell us about the number of categories in the occupation column?

Your response: TODO

# TODO your code here
occupation_dummy_df = None

display(occupation_dummy_df.head())

Solution (click to check once you’ve completed 1.4)

The shape of occupation_dummy_df is (49531, 28), which has the same number of rows as income_df but 15 more columns. This tells us that there are 15 unique categories in the occupation column. We can also check this by using the nunique() function: income_df['occupation'].nunique()

2. Seaborn for data visualization [1 pt]#

seaborn is one of the most popular libraries for creating visualizations in Python, as it is a higher-level library built on top of the more fundamental matplotlib library.

We’ll use seaborn to visualize some relationships between variables in our income dataset.

First, let’s import seaborn using its standard import idiom, which abbreviates the name to sns:

import seaborn as sns

Many seaborn plots follow a similar argument pattern, where we often pass in the following arguments:

data: the dataframe to plot
x: the column to plot on the x-axis
y: the column to plot on the y-axis, if applicable
hue: the column to use for color-coding the points

Seaborn has tight integration with pandas, which allows us to use pandas to filter and manipulate the data before passing it to seaborn. Let’s generate a sns.histplot of income for individuals above the age of 30 who work in the private sector:

if __name__ == "__main__":
  # Select out data for individuals above the age of 30 who work in the private sector
  above_30_private_df = income_df[
        (income_df['age'] >= 30) 
      & (income_df['workclass'] == 'Private')
  ]

  # Generate the histogram with:
  # Data: the below_30_private_df dataframe
  # x-axis: 'income'
  # y-axis: is the count of individuals in each bin so it does not need to be specified
  sns.histplot(data=above_30_private_df, x='income')

2.1. You should see that the distribution tails off to the right, with a large mass of individuals at the $100k mark. Speculate on why you think this mass is at $100k, keeping in mind that this data is generated from a survey of individuals in the US:

Your response: TODO

The most common machine learning task associated with this dataset is a binary classification task, where we predict whether the income of an individual is greater than or equal to $50k based on the given features. This dataset in particular has been extensively used historically as a benchmark for evaluating the fairness of machine learning models, such as whether the model is biased towards certain groups of individuals. We’ll examine aspects of fairness in the upcoming weeks, but first let’s get a sense of the data. Run the cell below to create a column that is a binary indicator of income >= $50k:

income_df['income_>50k'] = income_df['income'] >= 50000

Complete the cell below to generate a histplot of age with the following parameters:

data=income_df: use the income_df dataframe for the plot
x='age': plot the age column on the x-axis

if __name__ == "__main__":
    # TODO your code here
    pass

2.2 Now, add the following parameters to your plot above:

hue='income_>50k': this colors the histogram bars by the income_>50k column
multiple='stack': stack the colored histograms on top of each other

Briefly describe what you see in the relationship between age and income_>50k in the plot above. Some questions to consider:

Are there ages where there are very few individuals with incomes > $50k?
Is there a particular age, or age range where the proportion of individuals with incomes > $50k is higher?

Your response: TODO

Observations (click to check once you’ve completed 2.2)

In this data, you should see that there are very few individuals with incomes > $50k with ages <20, as well as in the ~75-80 age range.

There is a peak in the proportion of individuals with incomes > $50k around age 47, and generally it looks like the older the individual, the more likely they are to have an income > $50k up until ~47, where it begins to slowly decline.

The top 4 most common occupations are (How would you check this using the pandas commands we saw earlier?):

Craft-repair
Prof-specialty
Exec-managerial
Adm-clerical

Let’s see how the proportion of individuals with incomes > $50k varies by these top 4 occupations with an sns.countplot, which shows counts or percentages across categories.

2.3 Complete the code below with the following parameters passed to countplot:

data=top4_occupations_df: use the top4_occupations_df dataframe for the plot
x='occupation': plot the occupation column on the x-axis
hue='income_>50k': this colors the bars by the income_>50k column
stat='percent': show the percentage of individuals in each bar on the y-axis

Which occupations seem to be most “predictive” of income > $50k?

Your response: TODO

if __name__ == "__main__":
    top4_occupations = [
        'Craft-repair',
        'Prof-specialty',
        'Exec-managerial',
        'Adm-clerical',
    ]

    # TODO select the rows in income_df where the 'occupation' column is in the top4_occupations list
    top4_occupations_df = None
    

Observations (click to check once you’ve completed 2.3)

The Exec-managerial and Prof-specialty occupations have the highest percentage of individuals with incomes > $50k, while the Adm-clerical and Craft-repair occupations have the lowest.

3. Probability primer [1 pt]#

We’re now in the process of moving into machine learning classification, where the goal is to predict categories instead of continuous values. For this, we’ll need to utilize some fundemental concepts from probability.

Chance events#

Probability is the mathematical framework that allows us to reason about randomness and chance. We often want to reason about discrete events that happen, such as whether a coin flip comes up heads or tails, whether it will rain tomorrow, or whether a machine learning model trained to recognize cats identifies an image as a cat or not.

For all of these situations, we assign probabilities to an event $A$ that happens, $P(A)$. For example, a fair coin has a 50% chance of landing heads and a 50% chance of landing tails, so:

\[P(\text{heads}) = 0.5, \quad P(\text{tails}) = 0.5\]

If we assign heads and tails to be numeric outcomes, e.g. $\text{heads} = 1$ and $\text{tails} = 0$, then the coin flip can be thought of as a random variable $Y$.

In order for something to be considered a valid (discrete) random variable, the probability of each event must be between 0 and 1, and the sum of the probabilities of all events must equal 1. In the case of a fair coin, $P(Y=1) + P(Y=0) = 0.5 + 0.5 = 1$.

We most frequently work in situations where the random variable is binary, for example:

\[\begin{split} Y = \begin{cases} 1 & \text{image has a cat} \\ 0 & \text{image does not have a cat} \end{cases} \end{split}\]

3.1: Suppose that we have a dataset of 1000 images, where 300 of them have cats in them. What is $P(Y=0)$, that is, the probability that a image does not have a cat?

Your response: TODO

Expectation#

The expectation of a random variable $Y$ is the average value of $Y$ over all possible outcomes. It is given by:

\[ E[Y] = \sum_{y \in \mathcal{Y}} y P(Y=y) \]

where $\mathcal{Y}$ is the set of all possible outcomes for $Y$. In the case of binary random variables, $\mathcal{Y} = \{0, 1\}$ for the two possible outcomes.

3.2: Continuing from our cat picture example above, what is the expectation $E[Y]$?

\[ E[Y] = TODO \]

Solutions (click to check once you’ve completed 3.1-3.2)

3.1: $P(Y=0) = 1 - P(Y=1) = 1 - 0.3 = 0.7$

3.2: $E[Y] = 0 \cdot P(Y=0) + 1 \cdot P(Y=1) = 0 \cdot 0.7 + 1 \cdot 0.3 = 0.3$

An identity that is often used is that $E[Y] = P(Y=1)$ for binary random variables.

Joint and conditional probabilities#

We also often want to reason about the probability of two random variables occurring together, called a joint probability. If the two random variables $Y$ and $Z$ are binary, we can represent the joint probability using a 2x2 contingency table:

\[\begin{split} \begin{array}{c|c|c} & Y=0 & Y=1 \\ \hline Z=0 & \text{count of } Z=0 \text{ AND } Y=0 & \text{count of } Z=0 \text{ AND } Y=1 \\ Z=1 & \text{count of } Z=1 \text{ AND } Y=0 & \text{count of } Z=1 \text{ AND } Y=1 \\ \end{array} \end{split}\]

Each entry in the contingency table is a count of the number of times the corresponding event occurs. For example, the entry in the first row and first column is the count of the number of times $Y=0$ and $Z=0$ occur together.

Let $N$ be the total count of all the entries in the contingency table. We can take the marginal probability of $Y$ by summing over the columns of the contingency table:

\[ P(Y=0) = \frac{(\text{count of } Y=0 \text{ and } Z=0) + (\text{count of } Y=0 \text{ and } Z=1)}{N} \]

\[ P(Y=1) = \frac{(\text{count of } Y=1 \text{ and } Z=0) + (\text{count of } Y=1 \text{ and } Z=1)}{N} \]

Similarly, we can take the marginal probability of $Z$ by summing over the rows of the contingency table:

\[ P(Z=0) = \frac{(\text{count of } Z=0 \text{ and } Y=0) + (\text{count of } Z=0 \text{ and } Y=1)}{N} \]

\[ P(Z=1) = \frac{(\text{count of } Z=1 \text{ and } Y=0) + (\text{count of } Z=1 \text{ and } Y=1)}{N} \]

The joint probability of two events $Y$ and $Z$ occurring together, $P(Y=y, Z=z)$, is given by the entry in the contingency table for the corresponding row and column. For example, if $Y=0$ and $Z=1$, then:

\[ P(Y=0, Z=1) = \frac{\text{count of } Y=0 \text{ and } Z=1}{N} \]

The conditional probability of $Y$ given $Z$, $P(Y=y \mid Z=z)$, is given by the entry in the contingency table for the corresponding row and column, divided by the marginal probability of $Z$:

\[ P(Y=y \mid Z=z) = \frac{P(Y=y, Z=z)}{P(Z=z)} \]

Suppose we also know that some of our 1000 images also contain a cardboard box. We can represent this as a new random variable $Z$:

\[\begin{split} Z = \begin{cases} 1 & \text{image contains a cardboard box} \\ 0 & \text{image does not contain a cardboard box} \end{cases} \end{split}\]

../_images/uni_cat_box.jpg — Fig. 1 A picture where $Z=1$ and $Y=1$. Source: Uni the cat#

We then have the following contingency table for our 1000 images:

\[\begin{split} \begin{array}{c|c|c} & Y=0 & Y=1 \\ \hline Z=0 & 600 & 100 \\ Z=1 & 100 & 200 \\ \end{array} \end{split}\]

3.3: Compute the following probabilities:

$P(Z=0) = TODO$
$P(Z=1) = TODO$
$P(Y=1 \mid Z=0) = TODO$
$P(Y=1 \mid Z=1) = TODO$

3.4 Given your answers to 3.3, does the presence or does the absence of a cardboard box seem to more predictive of whether an image has a cat in it?

Your response: TODO

Solutions (click to check once you’ve completed 3.3 and 3.4)

$P(Z=0) = 0.7$
$P(Z=1) = 0.3$
$P(Y=1 \mid Z=0) = 1/7 \approx 0.1429$
$P(Y=1 \mid Z=1) = 2/3 \approx 0.6667$

To assess “predictiveness”, we can look at the conditional probabilities of $Y$ given $Z$. The presence of a cardboard box seems to be more predictive of whether an image has a cat in it, as $P(Y=1 \mid Z=1) > P(Y=1 \mid Z=0)$. Additionally, $P(Y=1 \mid Z=1) > P(Y=1)$, so the presence of a cardboard box seems to be a useful predictor of whether an image has a cat in it overall.

4. Broadcasting and axis operations in NumPy [1.5 pts]#

As we scale up the number of features in our machine learning models, we can lean on broadcasting and axis operations in NumPy to make calculations across 2D arrays more efficient.

Broadcasting#

Broadcasting is one of the most powerful concepts in NumPy, but it is also one that takes some getting used to. So, let’s review the idea of NumPy array shapes and what happens when using operators on differently shaped arrays.

We saw on Worksheet 1 that arithmetic operations performed on arrays with the same shape are computed element-wise. For example:

import numpy as np

a = np.array([1, 1, 1])
b = np.array([2, 4, 6])

print(a + b) # Will print [3 5 7]

Note that both arrays have the same shape (3,), which indicates that they are 1D arrays with 3 elements. This is different from a 2D array of shape (3, 1), which is a vector with 3 rows and 1 column or a 2D array of shape (1, 3), which is a vector with 1 row and 3 columns.

If NumPy encounters two arrays with different shapes, it will attempt to broadcast the arrays together. What happens is that NumPy will compare their shapes element-wise, starting from the rightmost dimension and working its way to the left. Two dimensions are compatible when:

they are equal, or
one of them is 1

Once one of these conditions is met, the arithmetic operation is performed element-wise across that dimension. Below are some examples.

Below are some examples.

# a is an array of shape (2, 3)
a = np.array([[1, 2, 3],
              [4, 5, 6]])

# b is an array of shape (2, 3)
b = np.array([[1, 1, 1],
              [1, 1, 1]])

# Will print [[2 3 4], 
#             [5 6 7]]
print(a + b)

Above, working from the rightmost dimension, we see that both dimensions for a and b are equal:

a       (2d array): 2 x 3
b       (2d array): 2 x 3
result  (2d array): 2 x 3

Therefore, the arithmetic operation is performed element-wise across the array.

# a is an array of shape (2, 3)
a = np.array([[1, 2, 3],
              [4, 5, 6]])

# c is an array of shape (3,)
c = np.array([3, 2, 1])


# Will print [[ 3  4  3], 
#             [12 10  6]]
print(c * a)

Even though c has a different shape than a, we see that the dimension of c matches the last dimension of a, so the multiplication operation is performed element-wise across that dimension:

c       (1d array):     3
a       (2d array): 2 x 3
result  (2d array): 2 x 3

However, if we try to broadcast a with a 1D array of shape (2,), NumPy will raise a ValueError because the dimensions do not match:

a       (2d array): 2 x 3
d       (1d array):     2
result: ValueError because of a dimension mismatch

# a is an array of shape (2, 3)
a = np.array([[1, 2, 3],
              [4, 5, 6]])

# d is an array of shape (2,)
d = np.array([1, 2])

# Will raise an Error because the dimensions do not match
print(a + d)

4.1 If a is an array of shape (100,) and b is an array of shape (100, 10), what is the shape of a - b (or would it raise an error)?

Your response: TODO

4.2 If c is an array of shape (256, 16) and d is an array of shape (16,), what is the shape of c + d (or would it raise an error)?

Your response: TODO

4.3 Work out what the output of the following arithmetic operation will be before checking your answer by running the code:

# w is a 1D array of shape (2,)
w = np.array([1, -2])

# X is a 2D array of shape (3, 2)
X = np.array([[0, 1],
              [1, 0],
              [2, 2]])

print(w * X)

Your response:

[[TODO]]

Solutions (click to check once you’ve completed 4.1-4.3)

4.1: Since a is a 1D array and its shape does not match the rightmost dimension of b, this will raise a ValueError.

4.2: Since c’s rightmost dimension matches d’s shape, the addition will be successful. The resulting array will have shape (256, 16), with d being added to each row of c.

4.3: The multiplication will result in each row of X being multiplied by the corresponding element of w:

[[0, -2],
 [1, 0],
 [2, -4]]

Axis operations#

When working with 2D arrays, we often want to apply operations across rows or columns. NumPy provides a convenient way to do this using the axis parameter in many aggregation functions.

For example, we have used the np.sum function to sum all the elements in an array:

a = np.array([1, 2, 3])

# Will print 6
print(np.sum(a)) 

However, in 2D arrays we could also sum the elements across the rows or columns. By default, np.sum will sum all the elements in the array:

b = np.array([[1, 2, 3],
              [4, 5, 6]])

# Will print a single number, the sum of all the elements in b: 21
print(np.sum(b)) 

If we wanted to sum the column values for each row, we could specify the parameter axis=1, which tells np.sum to sum across the second dimension:

 # Will print [6 15]
print(np.sum(b, axis=1))

This produces a new array of shape (2,), which is the sum of each row in b.

Similarly, if we wanted the sum of each column, we could specify the parameter axis=0, which tells np.sum to sum across the first dimension:

 # Will print [5 7 9]
print(np.sum(b, axis=0))

This produces a new array of shape (3,), which is the sum of each column in b.

Another way we can think about the axis parameter is that we’re telling NumPy which dimension to collapse in the resulting array:

                           dim: 0   1
b                   (2d array): 2 x 3
# dimension 0 is collapsed
np.sum(b, axis=0)   (1d array): _   3


# dimension 1 is collapsed
np.sum(b, axis=1)   (1d array): 2   _

4.4. Suppose that X is a 2D array of shape (n, p). Write the line of code that computes the mean of each column of X, which results in an array of shape (p,):

Your response: TODO

Solution (click to check once you’ve completed 4.4)

We want to collapse the dimension that corresponds to the number of rows, so we specify axis=0: np.mean(X, axis=0)

Finally, let’s put broadcasting and axis operations together to more efficiently compute predictions from our linear regression model.

With one feature, our linear regression model prediction for a given example $x_i$ has the form:

\[ \hat{y}_i = w_0 + w_1 x_{i,1} \]

With two features on Homework 1, our linear regression model had the form:

\[ \hat{y}_i = w_0 + w_1 x_{i,1} + w_2 x_{i,2} \]

While we could compute the weighted sum for each row of X and then add the bias term manually, we’d like to generalize this to any number of features $p$:

\[ \hat{y}_i = w_0 + \sum_{j=1}^{p} w_j x_{i,j} \]

Furthermore, we’d like to compute the predictions for all examples in X at once:

\[\begin{split} \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_n \end{bmatrix} = \begin{bmatrix} w_0 + \sum_{j=1}^{p} w_j x_{1,j} \\ w_0 + \sum_{j=1}^{p} w_j x_{2,j} \\ \vdots \\ w_0 + \sum_{j=1}^{p} w_j x_{n,j} \end{bmatrix} \end{split}\]

Complete the function below to compute the predictions for all examples in X at once:

Hint

This can be done using a single line of code. First, we can use broadcasting to compute the product of X and w. Then we can sum across the appropriate axis to compute $\sum_{j=1}^{p} w_j x_{i,j}$ for each row. Finally, since $w_0$ is a scalar, we can just add it to the result.

def linreg_predictions(X: np.ndarray, w: np.ndarray, w0: float) -> np.ndarray:
    """Efficiently compute the predictions for all examples in X at once.

    Args:
        X: data examples of shape (n, p)
        w: weights of shape (p,)
        w0: scalar intercept term

    Returns:
        np.ndarray of shape (n,) where each element is the prediction for the corresponding example in X
    """

    # TODO your code here
    return None

if __name__ == "__main__":
    X = np.array([[1., 2.],
                  [3., 4.],
                  [5., 6.]])
    w = np.array([0.1, -0.2])
    w0 = 0.5

    predictions = linreg_predictions(X, w, w0)
    assert predictions.shape == (3,), "predictions should have shape (n,)"
    assert np.allclose(predictions, np.array([0.2, 0.0, -0.2])), "predictions are not correct"

5. Reflection [0.5 pts]#

5.1 How much time did it take you to complete this worksheet?

Your Response: TODO

5.2 What is one thing you have a better understanding of after completing this worksheet and going though the class content this week? This could be about the concepts, the reading, or the code.

Your Response: TODO

5.3 What questions or points of confusion do you have about the material covered in the past week of class?

Your Response: TODO

Acknowledgments#

Portions of this worksheet are adapted from Bhargavi’s study notes on pandas.
Some exercises are adapted from Deisenroth 2020: Mathematics for Machine Learning.
Folktables was introduced by Ding et al. 2021: Retiring Adult: New Datasets for Fair Machine Learning

Worksheet 2

Contents

Worksheet 2#

Learning Objectives#

1. Pandas for tabular data [2 pts]#

Column selection and filtering#

Grouping and aggregation#

Processing categorical variables with get_dummies#

2. Seaborn for data visualization [1 pt]#

3. Probability primer [1 pt]#

Chance events#

Expectation#

Joint and conditional probabilities#

4. Broadcasting and axis operations in NumPy [1.5 pts]#

Broadcasting#

Axis operations#

5. Reflection [0.5 pts]#

Acknowledgments#

Processing categorical variables with `get_dummies`#