Giter VIP home page Giter VIP logo

mycroft's Introduction

Mycroft

Mycroft is a toolkit for doing text classification with word embeddings.

Out of the box Mycroft provides a command line interface for training and evaluating different kinds of neural network classifiers. It also provides a programmatic interface that help you write you own models. This interface allows you to focus on writing the model while getting data munging processing, text processing and embedding, prediction and evaluation code which remains the same regardless of model, and a command line interface for free.

Installation

Install Mycroft by running python setup.py install in this directory.

Mycroft uses the spaCy text natural language processing toolkit to process text. The built-in models just use spaCy for tokenization, but models built on Mycroft's programmatic interface may use spaCy's more advanced features like part-of-speech tagging and syntactic parsing. By default it installs spaCy's English language text model, though you may specify other language models from the command line. See spaCy's models documentation for more information.

Running from the Command Line

Run Mycroft with the command mycroft. Subcommands enable you to train models and use them to make predictions on unlabeled data sets. Run mycroft --help for details about specific commands.

The training data is a comma- or tab-delimited file with column of text and a column of labels. The test data is in the same format without the labels.

Mycroft implements two kinds of word-embedding models.

  • Recurrent neural network

    300-dimensional GloVe vectors are used to embed the text into matrices of size sequence length ร— 300, clipping or padding the first dimension for each individual text as needed. A stack of recurrent neural networks (either GRUs or LSTMs) converts these embeddings to a single vector which a softmax layer then uses to make a label prediction.

  • Convolutional network

    This works the same as the recurrent neural network, except that it summarizes the sentence embedding matrices with a 1-dimensional convolutional/max-pooling network instead of a stack of recurrent neural networks.

  • Bag of words

    The same GloVe vectors are used to embed the tokens in the text. A softmax layer uses the average of the token embeddings to make a label prediction.

The hyper-parameters of these models are specified by command line parameters. Command line parameters can also be passed in as a text file, one parameter per line, with the text file name prefixed with an @ sign, e.g. mycroft @my-args.

Evaluation on training and validation sets returns the classification accuracy and the cross entropy loss.

Run mycroft demo to see a quick example of the command line syntax and data formats.

Programmatic Interface

You can write your own Keras-based text-embedding classifiers by extending the mycroft.model.TextEmbeddingClassifier base class in and using subclasses of mycroft.text.Embedder to handle text processing and word embedding.

See convolution_net.py in the examples for detailed instructions on how to do this.

mycroft's People

Contributors

wpm avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.