Giter VIP home page Giter VIP logo

wmd-relax's Introduction

Fast Word Mover's Distance Build Status PyPI codecov

Calculates Word Mover's Distance as described in From Word Embeddings To Document Distances by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.

Word Mover's Distance

The high level logic is written in Python, the low level functions related to linear programming are offloaded to the bundled native extension. The native extension can be built as a generic shared library not related to Python at all. Python 2.7 and older are not supported. The heavy-lifting is done by google/or-tools.

Installation

pip3 install wmd

Tested on Linux and macOS.

Usage

You should have the embeddings numpy array and the nbow model - that is, every sample is a weighted set of items, and every item is embedded.

import numpy
from wmd import WMD

embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32)
nbow = {"first":  ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)),
        "second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))}
calc = WMD(embeddings, nbow, vocabulary_min=2)
print(calc.nearest_neighbors("first"))
[('second', 0.10606599599123001)]

embeddings must support __getitem__ which returns an item by it's identifier; particularly, numpy.ndarray matches that interface. nbow must be iterable - returns sample identifiers - and support __getitem__ by those identifiers which returns tuples of length 3. The first element is the human-readable name of the sample, the second is an iterable with item identifiers and the third is numpy.ndarray with the corresponding weights. All numpy arrays must be float32. The return format is the list of tuples with sample identifiers and relevancy indices (lower the better).

It is possible to use this package with spaCy:

import spacy
import wmd

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

Besides, see another example which finds similar Wikipedia pages.

Building from source

Either build it as a Python package:

pip3 install git+https://github.com/src-d/wmd-relax

or use CMake:

git clone --recursive https://github.com/src-d/wmd-relax
cmake -D CMAKE_BUILD_TYPE=Release .
make -j

Please note the --recursive flag for git clone. This project uses source{d}'s fork of google/or-tools as the git submodule.

Tests

Tests are in test.py and use the stock unittest package.

Documentation

cd doc
make html

The files are in doc/doxyhtml and doc/html directories.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

README {#ignore_this_doxygen_anchor}

wmd-relax's People

Contributors

vmarkovtsev avatar juanjux avatar zurk avatar quinor avatar mcuadros avatar maybemkl avatar ilyasyoy avatar strikerrus avatar wbecker avatar cfreksen avatar rafaeljcdarce avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.