Final Project Template#

The overall structure of your final project should read like a technical blog post, where you motivate your machine learning question to the readers, introduce the relevant background, give an overview of your machine learning approach, and then present your models’ performance along with the model card(s).

You should address the points outlined under each section, however feel free to format your final project however you like to make it more narrative or expository. In particular, you can add or modify subsections as needed, as well as delete the outlined bullet points and templated admonition sections. You can refer to the template on the course website as you work on your final project to make sure you’re addressing all the points.

Tip

For those who are interested in structuring academic writing, I suggest reading Mensh and Kording 2017: Ten simple rules on structuring papers.

Note that this is completely optional, you are not required to adhere to these writing rules.

Tip

If you want to hide code cells to make your final project look cleaner, you can use Jupyter functionality for hiding/removing content to toggle the visibility of the code cells for your rendered article.

ML question [1 pt]#

  • Give an appropriate title for your project, replacing “Final Project Template” above.

  • Motivation: Motivate the overall question you are trying to answer to a general audience.

  • Teaser figure: Provide at least one visual or media “teaser figure” to help set up your project. See the Myst markdown guide for how to add figures.

  • Prediction use cases: Explain what real-world decision or impact a model answering this question could support, being specific about what someone or some entity could do with the prediction.

  • Deployment impact: Explain who might be affected and the potential impacts (positive or negative) if this model were to be deployed.

Note

This section can use your responses to the dataset proposal as an outline.

Dataset Datasheet [2 pts]#

Following the Datasheets for Datasets framework (Gebru et al. 2021) that we studied in HW 2, complete the datasheet below for your project’s dataset to the best of your knowledge. Provide ~1–3 sentences per question.

Note

If a question is not applicable to your dataset or you cannot find the information, that is okay – make a note and briefly explain why.

Motivation#

  1. Purpose: For what purpose was the dataset created? Was there a specific task in mind?

  2. Creators: Who created the dataset and on behalf of what entity?

Composition#

  1. Instances: What do the instances in the dataset represent (e.g., people, images, transactions)? How many instances are there in total?

  2. Features and target: What features does the dataset contain, and what is the prediction target? Are there any features that could be considered sensitive or protected attributes?

  3. Missing data: Is any information missing from individual instances? If so, describe the nature and extent of the missingness.

  4. Confidentiality: Does the dataset contain data that might be considered confidential or that relates to identifiable individuals?

Collection Process and Distribution#

  1. Collection mechanism: How was the data collected (e.g., survey, web scraping, sensors, administrative records)?

  2. Time frame: Over what time frame was the data collected?

  3. Consent and ethics: Were the individuals whose data is included notified or given a chance to consent?

  4. Availability: Where is the dataset hosted, and under what license is it distributed?

Features [1 pt]#

Tip

Your preprocessing code can live in this notebook or in a separate notebook file. If you use a separate file, describe what it does here and make sure to include it in your submission.

Preprocessing and feature engineering#

  • Describe how you split your data into training and test sets, including the split ratio.

  • Describe any feature preprocessing steps you applied, such as:

    • One-hot encoding of categorical features

    • Standard scaling of numerical features

    • Handling of missing values

    • Any feature engineering or new features created from existing ones

    • Any feature selection performed

Data exploration and visualization#

  • Provide a summary statistics table of at least two features (more if you’d like!) you find notable in your dataset (e.g., mean, std, min, max for numerical features; value counts for categorical features).

  • Provide relevant visualizations of those two features, such as a histogram, scatter plot, or bar plot. Refer to Worksheet 2 for a refresher on seaborn and plotting concepts.

  • Briefly comment on what you observe.

Models and Training#

Baseline model [1 pt]#

  • Fit a baseline model: logistic regression (for classification) or linear regression (for regression) with L2 regularization, with \(\lambda\) selected using a grid search with cross-validation.

  • Report the baseline model’s performance on your test set using an appropriate metric (e.g., accuracy, AUC, RMSE).

  • Briefly discuss: does the baseline model do better than a naive strategy (e.g., always predicting the majority class or predicting the mean value)?

Exploratory model(s) [1.5 pts]#

Fit and evaluate one new model per group member from scikit-learn that we did not cover in class. For each model:

  • Briefly describe the core idea of the model (~1 paragraph). Feel free to include equations or pseudocode if helpful (but not required), and cite any references you used to learn about the model.

  • Identify at least one relevant hyperparameter and explain what it controls.

  • Report the model’s performance on your test set using the same metric as the baseline. Note that you do not need to tune the hyperparameters for the exploratory model(s), unless you select it as your “model of choice”.

Tip

Some models to consider: Lasso, Elastic Net, Support Vector Machines, Naive Bayes, AdaBoost, Gaussian Processes. Check the sklearn user guide for descriptions and API documentation.

Model of choice [1.5 pts]#

  • Select one model you think will perform best on your dataset. This could be one of your exploratory models with tuned hyperparameters, a random forest, gradient boosting, a neural network, or any other approach.

  • Describe how you selected hyperparameters and what combination of hyperparameters you optimized for.

  • Report the model’s test performance.

Evaluation [1 pt]#

  • Provide a markdown comparison table summarizing the performance of all your models (baseline, exploratory, and model of choice) on the test set.

  • Provide at least one evaluation visualization comparing models, such as:

    • A confusion matrix heatmap (for multi-class classification)

    • An ROC curve(for binary classification)

    • A residual plot (for regression)

    • A predicted vs. actual scatter plot (for regression)

  • Briefly discuss your results: Which model performed best, and why do you think that is the case? Could the model(s) be overfitting?

Model Card(s) [2 pts]#

Following the Model Cards for Model Reporting framework (Mitchell et al. 2019) that we studied in HW 3, complete one model card per group member. You may choose any of the models you built, not necessarily the best performing one.

Model Card: [TODO model name]#

Model Details

  • Model type: TODO (e.g., random forest, logistic regression, neural network)

  • Key hyperparameters: TODO

  • Training data size: TODO examples, TODO features

  • Software: TODO (e.g., scikit-learn, PyTorch)

Intended Use

  • Primary intended use: TODO describe the prediction task and who would use this model.

  • Out-of-scope uses: TODO identify at least one use case this model should not be used for, and briefly explain why.

Factors

  • Relevant factors: TODO describe demographic, environmental, or instrumentation factors that could affect model performance. Think about the three factor categories from Mitchell et al. 2019 Section 4.3: groups, instrumentation, and environment.

  • Evaluation groups: TODO if applicable, describe the groups you evaluated disaggregated performance across.

Metrics

Report an evaluation metric you find most relevant to your model’s prediction task. Include disaggregated metrics for at least two subgroups in your dataset. Subgroups can be demographic (such as sex or age bracket) if your dataset contains such attributes, or they can be meaningful non-demographic groups (e.g., image classes with fewer training examples, geographic regions, or categories where you suspect the model may perform differently). The goal is to check whether your model works equally well across different parts of your data.

Metric

Overall

Group 1

Group 2

TODO

TODO

TODO

TODO

Ethical Considerations

  • TODO discuss any potential harms or biases. Consider: who benefits and who could be harmed by this model’s predictions? Are there disparities in performance across groups? What steps could be taken to mitigate these issues?

Caveats and Recommendations

  • TODO note any limitations of the model, gaps in the evaluation, or avenues for future work.

Reflection [0.5 pts]#

  • Summary: Return to the ML question you posed at the beginning of your report. Based on your evaluation results, would you recommend deploying this model for the use case you described?

  • Looking back: Reflect on the ML process you followed in this project. What is something you would have liked to pursue further if you had more or different data? What is one thing you understand better now than when you started the project?

What to submit [1 pt]#

You should submit a final_project.zip file containing your typeset Myst website. You can follow the same process as HW 3 to build and launch your website. Note: if you encounter an “address already in use” error when when running the http server, you change the port number to anything in the range 8000-8999.

To create a zip of your final project, run the following commands in a terminal:

# change directory to the projects folder
cd ~/comsc335.github.io/final_project/

# packages the site into a zip file
zip -r final_project.zip _build/

Please upload the zip file to Google Drive and paste the link to the file below:

  • TODO zip Google Drive link here

You should also include all source code .ipynb or .py files used for data cleaning/preprocessing to reproduce your results as part of your Gradescope submission.

Tip

The published html article should still have source code cells. However, you may want to do data cleaning/preprocessing if needed in a separate file so that the focus in your rendered article is on the analysis and results.