Giter VIP home page Giter VIP logo

datasets's Introduction

wadefagen's Useful Datasets

This repository contains a collection of datasets I've found useful. Many of these datasets are clean versions of public datasets, provided in a clean, consistent format for use in data science projects.

Available Datasets

General Format

Unless otherwise noted, all datasets are CSV files where the first row contains column headers.

Common column names across multiple datasets include:

  • Year, a four digit year (ex: 2018, 2017, etc)
  • Term, one of Spring, Summer, Fall, or Winter
  • YearTerm, a four digit year followed by -sp, -su, -fa, or -wi. For example: 2018-sp. This format ensure that all YearTerm >= "2016-fa" contains all data available from the Fall 2016 to present.

Useful Scripts

If you're working with these datasets, the following snippets may be helpful to load the data. Each example assumes you have cloned this repo inside of your project's working directory (as datasets, the default name).

Python (pandas)

import pandas as pd

df = pd.read_csv('datasets/gpa/uiuc-gpa-dataset.csv')
# `df` is a DataFrame of the CSV file

Python (dictionary)

import csv

with open("datasets/gpa/uiuc-gpa-dataset.csv", "r") as f:
  reader = csv.DictReader(f)
  for row in reader:
    # Each `row` is a row from the CSV as a Python dict indexed with column headers.
    
    # Example usage:
    term = row["Term"]
    year = int(row["Year"])    # Note that Python treats all data as strings; may be useful to make the year an `int`

JavaScript (node.js)

With the csv-parse package (npm install --save csv-parse):

const parse = require('csv-parse/lib/sync');

var rows = parse( fs.readFileSync("datasets/gpa/uiuc-gpa-dataset.csv"), {columns: true} );
rows.forEach(function (row) {
  // Each `row` is a row from the CSV as a dictionary indexed with column headers.

  // Example usage:
  var term = row["Term"];
  var year = row["Year"];
});

datasets's People

Contributors

chin123 avatar dependabot[bot] avatar elliewix avatar sahilkamesh avatar sileod avatar tinaabraham17 avatar wadefagen avatar will1982 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasets's Issues

Clarify what Semester the Data is from

When adding data to the Student by State Dataset I noticed that it is not stated what semester the data is from. I had to double check with the original data source to ensure that I used the correct semester for additions maybe the readme could be updated to add that the data is from the Fall.

Raw Data Accuracy & Consistency

In the fa2018.csv, I noticed that most of the STAT400+ courses only contain the GPA data for GR (graduate) sections, but, according to course explorer, our university also offered several UG (undergrad) sections for each STAT400+ courses.

From README, I know that "Based on analysis, courses with 20 or fewer students were excluded (the smallest course in the dataset has 21 students)." However, the number of students in UG sections are usually higher than the number of students in GR sections, so what should be excluded is the GPA of GR sections but rather those of UG sections.

Take FA18's STAT432 as an example. The actual number of enrollment of 2GR is 19 and that of 2UG is 53. However, if you do the addition from A+ to F for CRN 70222 (section 2GR), the number of students is 76. Therefore, I was wondering if there were some errors while conducting the data cleaning/integration, which caused the inconsistency of the raw data.

Extraneous character line 776

On line 776 there is the following,
2019,Spring,2019-sp,ANTH,364,"Performing ""America""
The extraneous quotation before America causes a bug when parsing the CSV showing the title for this row as
Performing "America"\n2019,Spring,2019-sp,ANTH,368,\'America\' in the World"
This results in ANTH 368 being omitted as well

Possible issue with Summer 2011 data

I noticed that Summer 2011 has 2762 records, but Summer 2010 has 193 and 2012 has 177.

Certain course subjects are more affected than others. ACCY, BADM, ECON, MATH, MCB and PSYC have many more records than I'd expect based on the surrounding years.

I checked ECON 102, and the records may be from a different term. The dataset has multiple entries per instructor for SU11, and the dataset shows many instructors not listed in the SU11 class schedule for that course.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.