wadefagen / datasets Goto Github PK

Various interesting datasets, mostly data from The University of Illinois

Python 0.07% JavaScript 0.56% Shell 0.01% Jupyter Notebook 99.36%

datasets's Introduction

wadefagen's Useful Datasets

This repository contains a collection of datasets I've found useful. Many of these datasets are clean versions of public datasets, provided in a clean, consistent format for use in data science projects.

Available Datasets

GPAs of Courses at The University of Illinois, gpa/uiuc-gpa-dataset.csv
Teachers Ranked as Excellent by their Students at UIUC, teachers-ranked-as-excellent/uiuc-tre-dataset.csv
UIUC Courses by their General Education category, geneds/uiuc-geneds-dataset.csv
Students at The University of Illinois by their home state, students-by-state/uiuc-students-by-state.csv
UIUC Course Catalog, course-catalog/uiuc-course-catalog.csv
Fighting Illini Historical Football Scores, illini-football/illini-football-scores.csv

General Format

Unless otherwise noted, all datasets are CSV files where the first row contains column headers.

Common column names across multiple datasets include:

Year, a four digit year (ex: 2018, 2017, etc)
Term, one of Spring, Summer, Fall, or Winter
YearTerm, a four digit year followed by -sp, -su, -fa, or -wi. For example: 2018-sp. This format ensure that all YearTerm >= "2016-fa" contains all data available from the Fall 2016 to present.

Useful Scripts

If you're working with these datasets, the following snippets may be helpful to load the data. Each example assumes you have cloned this repo inside of your project's working directory (as datasets, the default name).

Python (pandas)

import pandas as pd

df = pd.read_csv('datasets/gpa/uiuc-gpa-dataset.csv')
# `df` is a DataFrame of the CSV file

Python (dictionary)

import csv

with open("datasets/gpa/uiuc-gpa-dataset.csv", "r") as f:
  reader = csv.DictReader(f)
  for row in reader:
    # Each `row` is a row from the CSV as a Python dict indexed with column headers.
    
    # Example usage:
    term = row["Term"]
    year = int(row["Year"])    # Note that Python treats all data as strings; may be useful to make the year an `int`

JavaScript (node.js)

With the csv-parse package (npm install --save csv-parse):

const parse = require('csv-parse/lib/sync');

var rows = parse( fs.readFileSync("datasets/gpa/uiuc-gpa-dataset.csv"), {columns: true} );
rows.forEach(function (row) {
  // Each `row` is a row from the CSV as a dictionary indexed with column headers.

  // Example usage:
  var term = row["Term"];
  var year = row["Year"];
});

datasets's People

Contributors

Stargazers

Watchers

datasets's Issues

Credit to original FOIA author

I thought it might be nice to credit the person responsible for getting this data released:

https://austingwalters.com/university-of-illinois-urbana-champaign-grade-distributions/

Maybe include it in the readme?

Clarify what Semester the Data is from

When adding data to the Student by State Dataset I noticed that it is not stated what semester the data is from. I had to double check with the original data source to ensure that I used the correct semester for additions maybe the readme could be updated to add that the data is from the Fall.

Raw Data Accuracy & Consistency

In the fa2018.csv, I noticed that most of the STAT400+ courses only contain the GPA data for GR (graduate) sections, but, according to course explorer, our university also offered several UG (undergrad) sections for each STAT400+ courses.

From README, I know that "Based on analysis, courses with 20 or fewer students were excluded (the smallest course in the dataset has 21 students)." However, the number of students in UG sections are usually higher than the number of students in GR sections, so what should be excluded is the GPA of GR sections but rather those of UG sections.

Take FA18's STAT432 as an example. The actual number of enrollment of 2GR is 19 and that of 2UG is 53. However, if you do the addition from A+ to F for CRN 70222 (section 2GR), the number of students is 76. Therefore, I was wondering if there were some errors while conducting the data cleaning/integration, which caused the inconsistency of the raw data.

Extraneous character line 776

On line 776 there is the following,
2019,Spring,2019-sp,ANTH,364,"Performing ""America""
The extraneous quotation before America causes a bug when parsing the CSV showing the title for this row as
Performing "America"\n2019,Spring,2019-sp,ANTH,368,\'America\' in the World"
This results in ANTH 368 being omitted as well

Possible issue with Summer 2011 data

I noticed that Summer 2011 has 2762 records, but Summer 2010 has 193 and 2012 has 177.

Certain course subjects are more affected than others. ACCY, BADM, ECON, MATH, MCB and PSYC have many more records than I'd expect based on the surrounding years.

I checked ECON 102, and the records may be from a different term. The dataset has multiple entries per instructor for SU11, and the dataset shows many instructors not listed in the SU11 class schedule for that course.

Students by state - update data

The most recent data in the CSV is from 2017:
https://github.com/wadefagen/datasets/blob/9f5181ce20292642c13634d39ce6caca3ccdbfbe/students-by-state/uiuc-students-by-state.csv

I think someone (I'm willing to help if nobody else wants) should copy the data from 2018-2023 over here. (See "Data Source" in the README) Thanks!

Winter 2014 Semester Existence

Winter 2014 semester doesn't exist on Course Explorer but there are 8 records in the GPA dataset.