Giter VIP home page Giter VIP logo

corgis's Introduction

corgis2

The new version of the CORGIS Project, featuring Tabular datasets.

This repository contains the tool that builds datasets, in addition to the raw and built datasets themselves.

Building Datasets

>>> pip install -r requirements.txt

Datasets should be built using Python 3.7. Although the datasets themselves are compatible with a wide range of formats and language versions, building the actual datasets is best done with Python 3.7. This is particularly important for the Python destination format, to ensure that the pickled dictionaries are stored in consistent order - this prevents unnecessary clogging of commit history.

The Directory Format

A new dataset should be placed into the source/ directory in a folder with a unique name. All datasets must be given a unique name, which will influence the names of the files that belong in that directory:

  • <name>-metadata.csv: Metadata file
  • <name>-corgis.csv: Dataset file

Additionally, you are encouraged to use the following conventions for additional files:

  • <name>-icon.png: The icon that will be associated with this dataset.
  • <name>-splash.png: The splash page image that will be associated with this dataset (a bigger, more fun picture).
  • <name>-script.py: The primary program used to synthesize the Dataset file.
  • raw/: A folder for storing raw data files used to generate the Dataset file.

Metadata Format

Name,Weather,,,
Version,0.0.1,,,
Author,Dennis Kafura ([email protected]),,,
...

Dataset Format

State,Date.Day,Date.Month,Temperature.Minimum,Temperature.Maximum
Delaware,12,2,45,67
Delaware,13,2,42,64
Virginia,12,2,24,35

Automatically Running Builder

Dataset versions are tracked via the index.json file in configured build directory.

There are two things that trigger a build of a dataset:

  • Updating the version of a builder
  • Creating a new folder in source/, with the appropriate files and structure.
  • Incrementing the version number in a Metadata file, following Semver rules (X.Y.Z):
    • Patch (Z): When making bug fixes, not changing layout or structure.
    • Minor (Y): When new, backwards compatible functionality is introduced.
    • Major (X): When any backwards incompatible changes are introduced to the public API

Whenever you push to the repo, a Travis script will scan the source/ directory and compare the information there with the current information in the index.json. Any out-of-date libraries are rebuilt and placed in the configured build directory.

Manually Running Builder

>>> python corgis.py police-shootings python

corgis's People

Contributors

acbart avatar githubbot avatar joungmin-choi avatar lukesg08 avatar notsamdonald avatar samdonaldvt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

corgis's Issues

Typo in County Demographics

One of the UTAs told me that there is a misspelling of a key in the
County Demographics data set. It is:

"Income"-->"Median Houseold Income"

Add BlockPy-style .get method for the Python version

This is a feature request. Can you add the get method that's available in the BlockPy version of CORGIS to the Python version?

If I use CORGIS in BlockPy, there's this handy .get method that takes in an attribute, filter column, and filter value, and returns a list of data values with the correct type.

If I use the Python version, I only have the get_XXX method, e.g. get_report in the AIDS dataset. This returns a complex Python object, but it doesn't have the .get method (unless I'm somehow missing it - it just seems to be a dictionary).

Why this matters: My students often start their projects in BlockPy, but some of them realize they want functionality (e.g. writing a file) outside of BlockPy's abilities, so they download their script. When they do, it suddenly doesn't work anymore. Even if they download the CORGIS .data and .py files and put them in the same directory, the API is different. And the .get method is a nice utility function even for advanced programmers, so why not include it in each .py file?

Thanks!

Dataset issues

Medal of Honor: data is fine, it might be good to note in the description that a value of -1 means a missing value, since people who have never coded before may not know that is somewhat standard.

Music: Data is not fine. There is a field under artist called “hotttnesss” which is a kind of weird way to spell it. It seems like in this dataset, unknown int values default to 0, which should probably be noted in the description (especially if there is no universal int value for unknown). The description under the dropdown menu for Artist.similar is “unknown” and a message pops up saying “Given this filter, there was no variation in the data - visualization will be empty”. Artist.terms_freq is all 0 values. Release.name’s description is “unknown”, and I can’t seem to make a graph that makes proper use of the field. Song.artist _mbtags’ description is “unknown”, all values are 0. For song.hotttnesss (again, a weird way to spell it), there are a lot of -1 values and 0 values- does this mean both ints were used as unknown, or is there something else I don’t understand about that? Song.mode values are all 0. Song.title values are all 0, which makes sense because it would be strings, but I can’t find a way to use the song.title field because the only bar chart “group by” fields are artist.name and song.year. For song.year it should be noted somewhere that unknown values are 0.

Opioids: data is all ok. Might be a good idea to note that the x axis for line plots is years since 1999.

Police shootings: dataset is pretty sparse but the data is ok.

Make CORGIS Pip installable

Does this mean it installs all datasets?

Or perhaps installing Corgis allows you to launch an interactive installer util?

Or perhaps we can send parameters to Pip?

New User or Instructions

I've been looking at the old repo trying to figure out how to use it and the 'contributor.md' was helpful and the issues list lead me here. I'd like to become a contributor and start by helping with the csv file that needs commas I found.

My main question is how do I as a new user access the billionaires list or run it as a data set for just that one database?

How often is that database updated or can I access just one?

Would like to contribute where I can.

Health Dataset Documentation

Somewhere over the years, we lost the docs on the Disease dataset. It's now called the Health dataset. Here's my notes on it from before:

https://www.tycho.pitt.edu/
This library provides data about reported ocurrences of diseases in American history over time. Some diseases (e.g., Measles) have more information and stretch back farther than others (e.g., "AIDS"). Some places seem to have missing information (either that, or the population of Virginia was immune to Measles until the 1920s).

Cancer Dataset has issues

Cancer Dataset explorer has a problem: report["Rates"]["Age and Sex"]["45 - 64"] (or any other 3 categories) do not display example value (it is a dict with two keys: “Female” and “Male”, each hold a float value)

I also think this dataset might just be hot garbage and needs to be reworked from the ground up. I see this was an issue dating back to the old version.

food.csv - nutritional information does not mention portion size

Both the documentation and the .csv file lack mentions of portion size, which makes it impossible to interpret nutritional content figures.
(e.g. raw blueberries are stated to contain 9.7mg of Vitamin C, but how much fruit are we talking about exactly?)

A cursory examination of the documentation accompanying the USDA FNDDS database suggests the raw figures are based on 100g portions. It appears that the csv file uses the same (but I don't know for sure).

Visualizer is slow to load big datasets

This isn't a network issue, it's client side. Very big datasets can take several seconds to render. Since this is happening on page load, it feels like the system is lagging. We should do a repaint before actually loading the dataset, have a loading symbol during dataset switches, and probably do find some way to let it happen in the background more (suspensions? promises? web workers?).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.