Giter VIP home page Giter VIP logo

lcd's Introduction

LCD

A study on the Long tail of Controversial Datasets

Planning

We know that there will be roughly two parts of the study. The first is defining the term controversial and what makes datasets controversial.

The next is using that definition to

  1. Define -- What makes a dataset controversial?

    • By features?

      • "Would you use a {release method} {content type} dataset from {data source} for {example work}?"
      • Ex: "Would you use a leaked (release method) human faces (images) dataset from a dating website (data source) for training ... (example work)?"
    • By usage?

      • Ex: "For learning / study?"
      • Ex: "In production?"
      • Ex: "Is the dataset readily available / easy to use?"
    • What are the most well known "controversial" datasets? Plain text, what makes them controversial?

    • Target IT related fields

  2. Understand -- How usage changes over time?

    • Find a few datasets that have been deemed controversial that have common import patterns, i.e. scikit-learns housing dataset. Scrape GitHub for such patterns and look at time of usage and topic of content.

    • This part should be specific to a few datasets to limit the number of variables.

    • We want to try to understand the context around the usage. Did the dataset move from "production" to "an example for students on why it was bad" or similar? Was there an immediate drop-off? How "long of a tail" is there is active usage of the dataset?

Datasets

Examples from Adam Harvey: https://twitter.com/adamhrv/status/1278604672408997889

Additional examples:

  • OKCupid Dataset
  • IBM Diversity in Faces (2019)
  • ENRON Emails
  • Ashley Madison Dataset

lcd's People

Watchers

Eva Maxfield Brown avatar

lcd's Issues

General notes

We can try to separate out a few datasets as strictly educational from potentially prototype / in production (iris and boston housing are typically educational, megaface is typically in production)

Use Student / Academic account labels for GitHub accounts as the scraping basis for detecting dataset use.

Add GitHub username to demographic info as a method to get github info for scraping

Describe in three paragraphs:

  • Trying to recruit GitHub profiles via survey
  • Mine GitHub for use of datasets agnostic of inclusion criteria (as a method for comparing our subset to the larger development community)
  • Creating a conceptual model of controversy via encoding and generalizing the datasets on exposing.ai and their reasoning.

Week 1

Proposal:

  • A set of questions, and why each is worth answering and how each could be answer
    • Can be somewhat general (i.e. "what is the rate of decay for a dataset that has been decommissioned")
  • Should state a hypothesis or two (i.e. "what do we expect the rate of decay to be given XYZ conditions")

Outline:

  • Question
  • Justification for the question
  • Hypothesis
  • Test Protocol
    • What data source
    • What data collection method
    • What data analysis method

Some resources:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.