corgis-edu / corgis Goto Github PK

View Code? Open in Web Editor NEW

10.0 5.0 4.0 411.01 MB

The new version of the CORGIS Project, featuring Tabular datasets

License: GNU General Public License v3.0

Python 0.09% Jupyter Notebook 0.01% HTML 0.02% Ruby 0.01% CSS 0.01% Makefile 0.01% JavaScript 99.89% SCSS 0.01%

corgis's Introduction

corgis2

The new version of the CORGIS Project, featuring Tabular datasets.

This repository contains the tool that builds datasets, in addition to the raw and built datasets themselves.

Building Datasets

>>> pip install -r requirements.txt

Datasets should be built using Python 3.7. Although the datasets themselves are compatible with a wide range of formats and language versions, building the actual datasets is best done with Python 3.7. This is particularly important for the Python destination format, to ensure that the pickled dictionaries are stored in consistent order - this prevents unnecessary clogging of commit history.

The Directory Format

A new dataset should be placed into the source/ directory in a folder with a unique name. All datasets must be given a unique name, which will influence the names of the files that belong in that directory:

<name>-metadata.csv: Metadata file
<name>-corgis.csv: Dataset file

Additionally, you are encouraged to use the following conventions for additional files:

<name>-icon.png: The icon that will be associated with this dataset.
<name>-splash.png: The splash page image that will be associated with this dataset (a bigger, more fun picture).
<name>-script.py: The primary program used to synthesize the Dataset file.
raw/: A folder for storing raw data files used to generate the Dataset file.

Metadata Format

Name,Weather,,,
Version,0.0.1,,,
Author,Dennis Kafura ([email protected]),,,
...

Dataset Format

State,Date.Day,Date.Month,Temperature.Minimum,Temperature.Maximum
Delaware,12,2,45,67
Delaware,13,2,42,64
Virginia,12,2,24,35

Automatically Running Builder

Dataset versions are tracked via the index.json file in configured build directory.

There are two things that trigger a build of a dataset:

Updating the version of a builder
Creating a new folder in source/, with the appropriate files and structure.
Incrementing the version number in a Metadata file, following Semver rules (X.Y.Z):
- Patch (Z): When making bug fixes, not changing layout or structure.
- Minor (Y): When new, backwards compatible functionality is introduced.
- Major (X): When any backwards incompatible changes are introduced to the public API

Whenever you push to the repo, a Travis script will scan the source/ directory and compare the information there with the current information in the index.json. Any out-of-date libraries are rebuilt and placed in the configured build directory.

Manually Running Builder

>>> python corgis.py police-shootings python

corgis's People

Contributors

Stargazers

Watchers

Forkers

copl68 blakes5 joungmin-choi wethepeopleonline

corgis's Issues

Typo in County Demographics

One of the UTAs told me that there is a misspelling of a key in the
County Demographics data set. It is:

"Income"-->"Median Houseold Income"

Add BlockPy-style .get method for the Python version

This is a feature request. Can you add the get method that's available in the BlockPy version of CORGIS to the Python version?

If I use CORGIS in BlockPy, there's this handy .get method that takes in an attribute, filter column, and filter value, and returns a list of data values with the correct type.

If I use the Python version, I only have the get_XXX method, e.g. get_report in the AIDS dataset. This returns a complex Python object, but it doesn't have the .get method (unless I'm somehow missing it - it just seems to be a dictionary).

Why this matters: My students often start their projects in BlockPy, but some of them realize they want functionality (e.g. writing a file) outside of BlockPy's abilities, so they download their script. When they do, it suddenly doesn't work anymore. Even if they download the CORGIS .data and .py files and put them in the same directory, the API is different. And the .get method is a nice utility function even for advanced programmers, so why not include it in each .py file?

Thanks!

Cars Dataset

https://rpubs.com/neros/61800

Dataset issues

Medal of Honor: data is fine, it might be good to note in the description that a value of -1 means a missing value, since people who have never coded before may not know that is somewhat standard.

Music: Data is not fine. There is a field under artist called “hotttnesss” which is a kind of weird way to spell it. It seems like in this dataset, unknown int values default to 0, which should probably be noted in the description (especially if there is no universal int value for unknown). The description under the dropdown menu for Artist.similar is “unknown” and a message pops up saying “Given this filter, there was no variation in the data - visualization will be empty”. Artist.terms_freq is all 0 values. Release.name’s description is “unknown”, and I can’t seem to make a graph that makes proper use of the field. Song.artist _mbtags’ description is “unknown”, all values are 0. For song.hotttnesss (again, a weird way to spell it), there are a lot of -1 values and 0 values- does this mean both ints were used as unknown, or is there something else I don’t understand about that? Song.mode values are all 0. Song.title values are all 0, which makes sense because it would be strings, but I can’t find a way to use the song.title field because the only bar chart “group by” fields are artist.name and song.year. For song.year it should be noted somewhere that unknown values are 0.

Opioids: data is all ok. Might be a good idea to note that the x axis for line plots is years since 1999.

Police shootings: dataset is pretty sparse but the data is ok.

CSV, JSON and Java links seem to be broken

These aren't working anymore since (I think) today:

https://think.cs.vt.edu/corgis/csv
https://think.cs.vt.edu/corgis/json
https://think.cs.vt.edu/corgis/java

School Shootings dataset

https://github.com/washingtonpost/data-school-shootings

This seems like a rough option for Virginia Tech. I mean, we do have some pretty dark datasets, but the local context is sensitive here. But is that a reason not to address something in a college setting? Need to have more conversation about this.

Make CORGIS Pip installable

Does this mean it installs all datasets?

Or perhaps installing Corgis allows you to launch an interactive installer util?

Or perhaps we can send parameters to Pip?

Tate Dataset artists' gender

Currently, unknown genders are empty strings. We should make them "unknown" or something more descriptive.

New User or Instructions

I've been looking at the old repo trying to figure out how to use it and the 'contributor.md' was helpful and the issues list lead me here. I'd like to become a contributor and start by helping with the csv file that needs commas I found.

My main question is how do I as a new user access the billionaires list or run it as a data set for just that one database?

How often is that database updated or can I access just one?

Would like to contribute where I can.

All cars in cars dataset are marked as hybrid

All of the values in the Engine Information.Hybrid column are "True", even though some (many/most?) of the cars are not hybrids.

Health Dataset Documentation

Somewhere over the years, we lost the docs on the Disease dataset. It's now called the Health dataset. Here's my notes on it from before:

https://www.tycho.pitt.edu/
This library provides data about reported ocurrences of diseases in American history over time. Some diseases (e.g., Measles) have more information and stretch back farther than others (e.g., "AIDS"). Some places seem to have missing information (either that, or the population of Virginia was immune to Measles until the 1920s).

Cancer Dataset has issues

Cancer Dataset explorer has a problem: report["Rates"]["Age and Sex"]["45 - 64"] (or any other 3 categories) do not display example value (it is a dict with two keys: “Female” and “Male”, each hold a float value)

I also think this dataset might just be hot garbage and needs to be reworked from the ground up. I see this was an issue dating back to the old version.

Tag datasets by difficulty/quality

Video Games Python page has some window visible without clicking Explore Datasets

food.csv - nutritional information does not mention portion size

Both the documentation and the .csv file lack mentions of portion size, which makes it impossible to interpret nutritional content figures.
(e.g. raw blueberries are stated to contain 9.7mg of Vitamin C, but how much fruit are we talking about exactly?)

A cursory examination of the documentation accompanying the USDA FNDDS database suggests the raw figures are based on 100g portions. It appears that the csv file uses the same (but I don't know for sure).

Visualizer is slow to load big datasets

This isn't a network issue, it's client side. Very big datasets can take several seconds to render. Since this is happening on page load, it feels like the system is lagging. We should do a repaint before actually loading the dataset, have a loading symbol during dataset switches, and probably do find some way to let it happen in the background more (suspensions? promises? web workers?).

Hospitals

User report:

I wanted to let you know that the descriptions for this data set for the *.Value and *.Quality are incorrect. Specifically, *Quality should be Worse, Average, Better and *.Value seems to be Lower, Average, Higher.
https://corgis-edu.github.io/corgis/csv/hospitals/