Section 2: Description of the data
rubric={reasoning:8,writing:2}
You are allowed to select any dataset you want for this project, as long as you have the license to use it publicly. Warning: finding a good data set can take a lot of time and effort. We therefore recommend that you select one that you have worked with in a previous lab in MDS and that you are already familiar with (for example the Gapminder, movie, or language data sets from 531 (all are on OneDrive)).
A few datasets that have been popular in previous years:
https://www.kaggle.com/zynicide/wine-reviews/data
https://www.kaggle.com/osmi/mental-health-in-tech-survey
https://github.com/themarshallproject/city-crime
Good general resources for finding interesting datasets:
https://github.com/fivethirtyeight/data
https://github.com/the-pudding/data
https://www.kaggle.com/datasets
In your proposal, briefly describe the dataset and the variables that you will visualize. If your are planning to visualize a lot of columns, provide a high level descriptor of the variable types rather than listing every single column. For example, indicate that the dataset contains a variety of categorical variables for demographics and provide a brief list rather than describing every single variable. You may also want to consider visualizing a smaller set of variables given the short duration of this project. This might include brief exploratory data analysis for you to grasp what could be interesting aspects to look at in your data. We will not be grading the EDA aspect, but feel free to include your EDA notebooks in the public GitHub repo, so that you have everything in one place.
Example writeup:
We will be visualizing a dataset of approximately 300,000 missed patient appointments. Each appointment has 15 associated variables that describe the patient who made the appointment (patient_id, gender, age), the health status (health_status) of the patient (Hypertension, Diabetes, Alcohol intake, physical disabilities), information about the appointment itself (appointment_id, appointment_date), whether the patient showed up (status), and if a text message was sent to the patient about the appointment (sms_sent). Using this data we will also derive a new variable, which is the predicted probability that a patient will show up for their appointment (prob_show).
Remember if your dataset has a lot of columns, stick to summaries and avoid listing out every single column. The example also differentiates columns that come with the dataset (i.e. Age) from new variables that you might derive for your visualizations (i.e ProbShow) - you should make a similar distinction in your write-up if you can. Another example of a good description of a dataset is the Kaggle world happiness report.