Final project dataset proposal#
– Names of all group members: [TODO list all members here]
Note
This component of the final project is due Monday 3/30 at 11:59pm. It will be graded on best-effort completion.
1. Question and motivation [1 pt]#
1.1 State the overall question you are trying to answer with machine learning, as well as your motivation for why you find this question interesting.
1.2 What real-world decision or action could a model answering this question support? Be specific about what someone or some entity could do with the prediction.
1.3 Who might be affected and what are the potential impacts (positive and negative) if this ML system were to be deployed?
1.1
Overall question: TODO
Motivation: TODO
1.2
TODO
1.3
TODO
2. Dataset choice and ML task formulation [2 pts]#
Find one dataset per group member (e.g. a group of 2 should find 2 datasets), that is relevant to the ML task you formulated in the previous section.
Note
Please choose datasets that are non-trivial in size and complexity. As a rule of thumb, aim for at least ~500 rows and at least 5 meaningful features (i.e., not one-hot-encoded expansions of the same original variable).
This is a guideline rather than a strict rule. If your dataset is smaller than this but is still well documented and clearly connected to your question, that is totally okay – please feel free to reach out to me to discuss if you are unsure!
For each dataset, provide a link to the dataset and answer the following questions.
2.1 Describe the data in your own words, and how it relates to the question you are trying to answer. Note that the dataset may not exactly align to your question – that’s okay!
2.2 What is the target \(y\) (what are we trying to predict)? Is it a regression or classification problem? If it is classification, what are the labels?
2.3 What are the example units \(i = 1, \ldots, n\) (what does one row represent)? How many features \(p\) are there?
2.4 Choose two features that you think would be most useful for predicting the target variable, and briefly describe why you think they are important.
2.5 Who might have collected the training data? What choices or incentives might have shaped what was measured (or left out)?
2.6 What is one limitation of this dataset that you suspect could matter for your project? For example: missing groups or features, outdated data, measurement error, etc.
Note
Again, you do not need to find the “perfect” dataset for your project. These questions are here for you to think through and acknowledge that there may be limitations to the data and how it might align with your question.
If working in a group, feel free to copy this markdown cell for each group member.
Dataset link: TODO
2.1
Brief description of the data: TODO
Why it seems relevant to our question: TODO
2.2
Target variable \(y\): TODO
Task type: TODO
Labels if classification: TODO
2.3
What one row represents: TODO
How many examples \(n\) and features \(p\) there are: TODO
2.4
Feature 1: TODO
Feature 2: TODO
2.5
TODO
2.6
TODO
Note
Groups of more than 1 person will not need to use multiple datasets for the final project: this is primarily for idea generation and exploration. I suggest that group members independently search for datasets that are relevant to the question they are trying to answer, and then discuss and agree on the best dataset to use for the final project.
How to submit
The Gradescope assignment for the proposal allows for group submissions. Please submit a single .ipynb file for your group, and make sure to include all members in the submission.