Unlike the previous projects, we are not providing you with any topic/dataset options. Topic selection is part of the challenge of the Capstone project!
When choosing a topic, think through these questions:
- What would I be motivated to work on?
- What data could I use?
- How could an individual or organization use my product or findings?
- What will I be able to accomplish in the time I have available?
- What challenges do I foresee with this project?
Sourcing new data is a valuable skill for data scientists, but it requires a great deal of care. An inappropriate dataset or an unclear business problem can lead you spend a lot of time on a project that delivers underwhelming results. The guidelines below will help you complete a project that demonstrates your ability to engage in the full data science process.
Your data must be...
-
Appropriate for supervised learning models. You may use unsupervised learning methods in your project (e.g. to generate cluster assignment labels), but there must be a substantial supervised learning component.
-
Usable to solve a specific business problem. This solution must rely on your model(s).
-
Somewhat complex. It should contain thousands of rows and features that require creativity to use. You can use a pre-existing clean dataset, but you should consider combining it with other datasets and/or engineering your own features.
-
Unfamiliar. It can't be one we've already worked with during the course or that is commonly used for demonstration purposes (e.g. MNIST).
-
Manageable. Stick to data that you can model with the knowledge and computational resources you have.
Once you've sourced your own data and identified the business problem you want to solve with it, you must to run them by your instructor for approval.
There are two ways that you can source your own dataset: Problem First or Data First. The less time you have to complete the project, the more strongly we recommend a Data First approach to this project.
Problem First: Start with a problem that you are interested in that you could potentially solve using one of the four project models. Then look for data that you could use to solve that problem. This approach is high-risk, high-reward: Very rewarding if you are able to solve a problem you are invested in, but frustrating if you end up sinking lots of time in without finding appropriate data. To mitigate the risk, set a firm limit for the amount of time you will allow yourself to look for data before moving on to the Data First approach.
Data First: Take a look at some of the most popular internet repositories of cool data sets we've listed below. If you find a data set that's particularly interesting for you, then it's totally okay to build your problem around that data set.
There are plenty of amazing places that you can get your data from. We recommend you start looking at data sets in some of these resources first:
- UCI Machine Learning Datasets Repository
- Kaggle Datasets
- Awesome Datasets Repo on Github
- Local data portals for state and local government resources
- Examples: NYC, Houston, Seattle, California
- Inside AirBNB
- FiveThirtyEight’s data portal
- Data is Plural’s Archive Spreadsheet
- Datasets Subreddit
- Tensorflow Datasets