Giter VIP home page Giter VIP logo

snorkel's Introduction

v0.5.0

Documentation Build Status License

Acknowledgements

Sponsored in part by DARPA as part of the SIMPLEX program under contract number N66001-15-C-4043 and also by the NIH through the Mobilize Center under grant number U54EB020405.

Getting Started

  • Installation instructions below
  • Get started with the tutorials below
  • Documentation here

Motivation

Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).

Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.

Users

We're lucky to have some amazing collaborators who are currently using Snorkel!

However, Snorkel is very much a work in progress, so we're eager for any and all feedback... let us know what you think and how we can improve Snorkel in the Issues section!

References

  • Data Programming: Creating Large Training Sets, Quickly, (NIPS 2016)
  • Data Programming with DDLite: Putting Humans in a Different Part of the Loop, (HILDA @ SIGMOD 2016)
  • Snorkel: A System for Lightweight Extraction, (CIDR 2017)
  • Data Programming: ML with Weak Supervision (blog)
  • Learning the Structure of Generative Models without Labeled Data, (preprint)
  • Fonduer: Knowledge Base Construction from Richly Formatted Data, (preprint, blog)

Installation / dependencies

Snorkel uses Python 2.7 and requires a few python packages which can be installed using pip:

pip install --requirement python-package-requirement.txt

If a package installation fails, then all of the packages below it in python-package-requirement.txt will fail to install as well. This can be avoided by running the following command instead of the above:

cat python-package-requirement.txt | xargs -n 1 pip install

Note that you may have to run pip2 if you have Python3 installed on your system, and that sudo can be prepended to install dependencies system wide if this is an option and the above does not work. For some pointers on difficulties in using source in shell, see Issue 506.

Finally, enable ipywidgets:

jupyter nbextension enable --py widgetsnbextension --sys-prefix

Note: Currently the Viewer is supported on the following versions:

  • jupyter: 4.1
  • jupyter notebook: 4.2

By default (e.g. in the tutorials, etc.) we also use Stanford CoreNLP for pre-processing text; you will be prompted to install this when you run run.sh.

Working with Conda

One great option, which can make installation and use easier, is to use conda. If you are running multiple version of Python, you might need to run:

conda create -n py2Env python=2.7 anaconda

And then run the correct environment:

source activate py2Env

Installing Numbskull + NUMBA

Snorkel currently relies on numbskull and numba, which occasionally requires a bit more work to install! One option is to use conda as above. If installing manually, you may just need to make sure the right version of llvmlite and LLVM is installed and used; for example on Ubuntu, run:

apt-get install llvm-3.8
LLVM_CONFIG=/usr/bin/llvm-config-3.8 pip install llvmlite
LLVM_CONFIG=/usr/bin/llvm-config-3.8 pip install numba

and on Mac OSX, one option is to use homebrew as follows:

brew install llvm38 --with-rtti
LLVM_CONFIG=/usr/local/Cellar/llvm\@3.8/3.8.1/bin/llvm-config-3.8 pip install llvmlite
LLVM_CONFIG=/usr/local/Cellar/llvm\@3.8/3.8.1/bin/llvm-config-3.8 pip install numba

Finally, once numba is installed, re-run the numbskull install from the python-package-requirement.txt script:

pip install git+https://github.com/HazyResearch/numbskull@master

Using virtualenv

Alternatively, virtualenv can be used by starting with:

virtualenv -p python2.7 .virtualenv
source .virtualenv/bin/activate

If you have issues using Jupyter notebooks with virualenv, see this tutorial

Running

After installing (see below), just run:

./run.sh

Learning how to use Snorkel

The introductory tutorial covers the entire Snorkel workflow, showing how to extract spouse relations from news articles. The tutorial is available in the following directory:

tutorials/intro

Issues

We like issues as a place to put bugs, questions, feature requests, etc- don't be shy! If submitting an issue about a bug, however, please provide a pointer to a notebook (and relevant data) to reproduce it.

Note: if you have an issue with the matplotlib install related to the module freetype, see this post; if you have an issue installing ipython, try upgrading setuptools

Jupyter Notebook Best Practices

Snorkel is built specifically with usage in Jupyter/IPython notebooks in mind; an incomplete set of best practices for the notebooks:

It's usually most convenient to write most code in an external .py file, and load as a module that's automatically reloaded; use:

%load_ext autoreload
%autoreload 2

A more convenient option is to add these lines to your IPython config file, in ~/.ipython/profile_default/ipython_config.py:

c.InteractiveShellApp.extensions = ['autoreload']     
c.InteractiveShellApp.exec_lines = ['%autoreload 2']

snorkel's People

Contributors

henryre avatar ajratner avatar stephenbach avatar jason-fries avatar bhancock8 avatar netj avatar pmlandwehr avatar bryanhe avatar alldefector avatar thammegowda avatar ajkl avatar lukehsiao avatar mooz avatar senwu avatar kuleshov avatar xiaoling avatar aliskin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.