Final Project#

Learning objectives#

  • Apply the machine learning principles you have learned in an area of your choice

  • Follow the machine learning process from start to finish

  • Gain experience communicating scientific results:

    • Visually and in writing through a Myst article

    • Verbally through a brief teaching presentation to the class

Logistics and assessment breakdown#

  • Final project groups can be one to three students.

  • Students are expected to contribute equally to the project, and the amount of work will scale with the size of the groups.

Component

Points

Due Date

Dataset Proposal

3

3/30

Checkpoint

5

4/27

Presentation and peer feedback

5

4/30, 5/5

Final report

12

5/11 noon

Final report rubric#

Component

Points

Source code and Myst article file

1

Machine learning question

1

Data: datasheet

2

Features: cleaning/engineering

1

Models: baseline, exploratory, and model of choice

3.5

Evaluation

1

Model card(s)

2

Reflection

0.5

Total

12

Evaluation Guidelines

Your final report will be evaluated on both completion and quality. Specifically, I’ll be looking for:

  • A completed project that addresses each section of the final report template

  • Clear visual and written presentation of your data, modeling choices, and results

  • Engagement with the datasheet and model card frameworks

This is your opportunity to showcase your understanding of the machine learning process and communicate your findings. I’m looking for thoughtful consideration of your approach rather than the highest possible model performance. If you have any questions, please do reach out!

Final project approach#

All projects will develop a narrative of the machine learning process from start to finish: Data \(\to\) Features \(\to\) Model \(\to\) Train \(\to\) Evaluate. Specific deliverables for certain steps are described below.

Data#

Your proposal and report will include a variation of a datasheet for your dataset as proposed by Gebru et al. (2021) like we saw in HW 2, which you will initially draft in the dataset proposal.

Features#

You will need to split your data into training and test sets, and also modify features into a format that can be used by the model. That means applying one-hot encoding to categorical features and standard scaling numerical features, as well as any other necessary preprocessing and cleaning.

Model and training#

You will build and evaluate multiple models as part of your project. Specifically, you will fit and evaluate the following:

  • Baseline model: logistic regression (for classification) or linear regression (for regression) with L2 regularization

  • Exploratory model(s): one new model per group member in sckit-learn that we did not cover in class. You will not need to understand every single detail of the model, but you will need to be able to use the API as well as describe the model’s relevant hyperparameters.

  • Model of your choice: one model of your choice that you think may perform well on your dataset. This could include random forests, gradient boosting, neural networks, etc.

Evaluate#

As part of your evaluation, you will write a variation of a model card by Mitchell et al. (2018) that documents your model’s characteristics, performance, and intended use. One model card will be created per group member. You may select any of the models you built for your project to write the model cards for – not necessarily the best performing models! We will cover the model card reading after Spring Break.

Report format#

The final report will be submitted in the form of a Myst website, which is the same engine that formats the course website. Modern scientific communication is moving beyond static papers towards the “executable book” format, which combines text, math, visualizations, and executable code in a single document. Your report can even be hosted online and showcased as part of your project portfolio. HW 3 will give you initial experience with the Myst website format, which you can then build on for your final project.

Example resources#

Dataset repositories#

The following repositories are suggestions for finding datasets, feel free to use other repositories as well!

  • folktables: provides access to datasets derived from the US Census. We have used two of the datasets in class assignments but feel free to explore other datasets in the library.

  • TidyTuesday: collection of real-world datasets intended for learning data science and visualization

  • ICPSR: Inter-university Consortium for Political and Social Research, a repository of social science data covering:

    • health

    • population health

    • education

    • aging

    • criminal justice

    • substance abuse

    • arts and culture

  • Opportunity Insights: a research organization that studies the impact of social programs on economic mobility, and also maintains a public data repository

  • Our World in Data: cleaned datasets and visualizations on global development, health, energy, environment, and inequality

  • Harvard Dataverse: a repository of research datasets connected to academic studies – good for replication studies

  • Google Dataset Search: a search engine for datasets if you have a general topic in mind

    • a word of caution about Kaggle: Kaggle is popular and maintains a large collection of datasets so will show up in many search results. However the quality, context, and documentation around the datasets can vary a lot, which may make filling out the datasheet a bit challenging.

Jupyter Book and Myst articles#

These are few examples of how Jupyter Book + Myst Markdown can be used to produce nicely typeset articles with interactive elements: