Giter VIP home page Giter VIP logo

snorkel's Introduction

v0.7.0-beta

Build Status Documentation License

Acknowledgements

Sponsored in part by DARPA as part of the D3M program under contract No. FA8750-17-2-0095 and the SIMPLEX program under contract number N66001-15-C-4043, and also by the NIH through the Mobilize Center under grant number U54EB020405.

Getting Started

Motivation

Check out a recent one-pager about Snorkel and the Software 2.0 vision!

Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).

Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.

Users & Sponsors

We're lucky to have some amazing collaborators who are currently using Snorkel!

However, Snorkel is very much a work in progress, so we're eager for any and all feedback... let us know what you think and how we can improve Snorkel in the Issues section!

References

Best References:

Further Reading:

Quick Start

This section has the commands to quickly get started running Snorkel. For more detailed installation instructions, see the Installation section below. These instructions assume that you already have conda installed.

First, download and extract a copy of the Snorkel directory from a GitHub release (version 0.7.0 or greater). Then navigate to the root of the snorkel directory in a terminal and run the following:

# Install the environment
conda env create --file=environment.yml

# Activate the environment
source activate snorkel

# Install snorkel in the environment
pip install .

# Optionally: You may need to explicitly set the Jupyter Notebook kernel
python -m ipykernel install --user --name snorkel --display-name "Python (snorkel)"

# Activate jupyter widgets
jupyter nbextension enable --py widgetsnbextension

# Initiate a jupyter notebook server
jupyter notebook

Then a Jupyter notebook tab will open in your browser. From here you can run existing Snorkel notebooks or create your own.

Note: This will install the default version of Python on your system; to specify a specific version, change the python version in the dependencies list in environment.yml, e.g. to python=2.7.

Tutorials

From within the Jupyter browser, navigate to the tutorials directory and try out one of the existing notebooks!

The introductory tutorial in tutorials/intro covers the entire Snorkel workflow, showing how to extract spouse relations from news articles. You can also check out all the great materials from the recent Mobilize Center-hosted Snorkel workshop!

Release Notes

Major changes in v0.7:

  • PyTorch classifiers
  • Installation now via Conda and pip
  • Now spaCy is the default parser (v1), with support for v2
  • And many more fixes, additions, and new material!

Older versions

Major changes in v0.6:

  • Support for categorical classification, including "dynamically-scoped" or blocked categoricals (see tutorial)
  • Support for structure learning (see tutorial, ICML 2017 paper)
  • Support for labeled data in generative model
  • Refactor of TensorFlow bindings; fixes grid search and model saving / reloading issues (see snorkel/learning)
  • New, simplified Intro tutorial (here)
  • Refactored parser class and support for spaCy as new parser
  • Support for easy use of the BRAT annotation tool (see tutorial)
  • Initial Spark integration, for scale out of LF application (see tutorial)
  • Tutorial on using crowdsourced data here
  • Integration with Apache Tika via the Tika Python binding.
  • And many more fixes, additions, and new material!

Installation

Starting with version 0.7.0, Snorkel should be installed as a Python package using pip. However, installing Snorkel via pip will not install dependencies, which are required for Snorkel to run. To manage its dependencies, Snorkel uses conda, which allows specifying an environment via an environment.yml file.

This documentation covers two common cases (usage and development) for setting up conda environments for Snorkel. In both cases, the environment can be activated using conda activate snorkel and deactivated using conda deactivate (for versions of conda prior to 4.4, replace conda with source in these commands). Users just looking to try out a Snorkel tutorial notebook should see the quick-start instructions above.

Using Snorkel as a Package

This setup is intended for users who would like to use Snorkel in their own applications by importing the package. In such cases, users should define a custom environment.yml to manage their project's dependencies. We recommend starting with the environment.yml in this repository. The below modifications can help customize it for your needs:

  1. Specifying versions for the listed packages, such as changing python to python=3.6.5. Versioned specification of your environment is critical to reproducibility and ensuring dependency updates do not break your pipeline. When first setting your package versions, you likely want to start with the latest versions available on the conda-forge channel, unless you have a reason to do otherwise.
  2. Adding other packages to your environment as required by your use case. Consider maintaining alphabetical sorting of packages in environment.yml to assist with maintainability. In addition, we recommend installing packages via pip, only if they are not available in the conda-forge channel.
  3. Add the snorkel package installation to your environment.yml, under the - pip section. Of course, we suggest versioning snorkel, which you can do via a release number or commit hash (to access more bleeding edge functionality)
  # Versioned via release tag
  - git+https://github.com/HazyResearch/[email protected]
  # Versioned via commit hash (commit hash below is fake to ensure you change it)
  - git+https://github.com/HazyResearch/snorkel@7eb7076f70078c06bef9752f22acf92fd86e616a

Finally, consider versioning the numbskull and treedlib pip dependencies by changing master to their latest commit hash on GitHub.

Development Environment

This setup is intended for users who have cloned this repository and would like to access the environment for development. This approach installs the snorkel package in development mode, meaning that changes you make to the source code will automatically be applied to the snorkel package in the environment.

# From the root direcectory of this repo run the following command.
conda env create --file=environment.yml

# Activate the conda environment (if using a version of conda below 4.4, use "source" instead of "conda")
conda activate snorkel

# Install snorkel in development mode
pip install --editable .

Additional installation notes

Snorkel can be installed directly from its GitHub repository via:

# WARNING: read installation section before running this command! This command
# does not install any dependencies. It installs the latest master version but
# you can change master to tag or commit
pip install git+https://github.com/HazyResearch/snorkel@master

Note: Currently the Viewer is supported on the following versions:

  • jupyter: 4.1
  • jupyter notebook: 4.2

Q & A

Many questions about Snorkel get answered in the issues section--along with general discussions and conversations of interest. We tag these all as "Q&A" and save them here

Issues

We like issues as a place to put bugs, questions, feature requests, etc- don't be shy! If submitting an issue about a bug, however, please provide a pointer to a notebook (and relevant data) to reproduce it.

Note: if you have an issue with the matplotlib install related to the module freetype, see this post; if you have an issue installing ipython, try upgrading setuptools

Jupyter Notebook Best Practices

Snorkel is built specifically with usage in Jupyter/IPython notebooks in mind; an incomplete set of best practices for the notebooks:

It's usually most convenient to write most code in an external .py file, and load as a module that's automatically reloaded; use:

%load_ext autoreload
%autoreload 2

A more convenient option is to add these lines to your IPython config file, in ~/.ipython/profile_default/ipython_config.py:

c.InteractiveShellApp.extensions = ['autoreload']     
c.InteractiveShellApp.exec_lines = ['%autoreload 2']

snorkel's People

Contributors

ajratner avatar henryre avatar stephenbach avatar jason-fries avatar bhancock8 avatar bryanhe avatar netj avatar jasontlam avatar vincentschen avatar catalinvoss avatar dhimmel avatar danich1 avatar fsonntag avatar hangyao avatar paidi avatar pmlandwehr avatar 4d4stra avatar lukehsiao avatar regoldman avatar moreymat avatar debnil avatar thammegowda avatar alldefector avatar cbockman avatar aliskin avatar xiaoling avatar yayitswei avatar kuleshov avatar senwu avatar mattmorgis avatar

Watchers

R Vaughan avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.