Giter VIP home page Giter VIP logo

explosion / spacy Goto Github PK

View Code? Open in Web Editor NEW
28.5K 554.0 4.3K 198.51 MB

πŸ’« Industrial-strength Natural Language Processing (NLP) in Python

Home Page: https://spacy.io

License: MIT License

Shell 0.01% Python 53.46% HTML 0.09% JavaScript 2.65% Makefile 0.02% Sass 0.78% Jinja 0.22% Cython 10.75% Dockerfile 0.01% C++ 0.01% C 0.13% TypeScript 0.36% MDX 31.52%
natural-language-processing data-science machine-learning python cython nlp artificial-intelligence ai spacy nlp-library

spacy's Introduction

spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products.

spaCy comes with pretrained pipelines and currently supports tokenization and training for 70+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license.

πŸ’« Version 3.7 out now! Check out the release notes here.

tests Current Release Version pypi Version conda Version Python wheels Code style: black
PyPi downloads Conda downloads spaCy on Twitter

πŸ“– Documentation

Documentation
⭐️ spaCy 101 New to spaCy? Here's everything you need to know!
πŸ“š Usage Guides How to use spaCy and its features.
πŸš€ New in v3.0 New features, backwards incompatibilities and migration guide.
πŸͺ Project Templates End-to-end workflows you can clone, modify and run.
πŸŽ› API Reference The detailed reference for spaCy's API.
⏩ GPU Processing Use spaCy with CUDA-compatible GPU processing.
πŸ“¦ Models Download trained pipelines for spaCy.
πŸ¦™ Large Language Models Integrate LLMs into spaCy pipelines.
🌌 Universe Plugins, extensions, demos and books from the spaCy ecosystem.
βš™οΈ spaCy VS Code Extension Additional tooling and features for working with spaCy's config files.
πŸ‘©β€πŸ« Online Course Learn spaCy in this free and interactive online course.
πŸ“° Blog Read about current spaCy and Prodigy development, releases, talks and more from Explosion.
πŸ“Ί Videos Our YouTube channel with video tutorials, talks and more.
πŸ›  Changelog Changes and version history.
πŸ’ Contribute How to contribute to the spaCy project and code base.
πŸ‘• Swag Support us and our work with unique, custom-designed swag!
Tailored Solutions Custom NLP consulting, implementation and strategic advice by spaCy’s core development team. Streamlined, production-ready, predictable and maintainable. Send us an email or take our 5-minute questionnaire, and well'be in touch! Learn more β†’

πŸ’¬ Where to ask questions

The spaCy project is maintained by the spaCy team. Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it.

Type Platforms
🚨 Bug Reports GitHub Issue Tracker
🎁 Feature Requests & Ideas GitHub Discussions
πŸ‘©β€πŸ’» Usage Questions GitHub Discussions Β· Stack Overflow
πŸ—― General Discussion GitHub Discussions

Features

  • Support for 70+ languages
  • Trained pipelines for different languages and tasks
  • Multi-task learning with pretrained transformers like BERT
  • Support for pretrained word vectors and embeddings
  • State-of-the-art speed
  • Production-ready training system
  • Linguistically-motivated tokenization
  • Components for named entity recognition, part-of-speech-tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
  • Easily extensible with custom components and attributes
  • Support for custom models in PyTorch, TensorFlow and other frameworks
  • Built in visualizers for syntax and NER
  • Easy model packaging, deployment and workflow management
  • Robust, rigorously evaluated accuracy

πŸ“– For more details, see the facts, figures and benchmarks.

⏳ Install spaCy

For detailed installation instructions, see the documentation.

  • Operating system: macOS / OS X Β· Linux Β· Windows (Cygwin, MinGW, Visual Studio)
  • Python version: Python 3.7+ (only 64 bit)
  • Package managers: pip Β· conda (via conda-forge)

pip

Using pip, spaCy releases are available as source packages and binary wheels. Before you install spaCy and its dependencies, make sure that your pip, setuptools and wheel are up to date.

pip install -U pip setuptools wheel
pip install spacy

To install additional data tables for lemmatization and normalization you can run pip install spacy[lookups] or install spacy-lookups-data separately. The lookups package is needed to create blank models with lemmatization data, and to lemmatize in languages that don't yet come with pretrained models and aren't powered by third-party libraries.

When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state:

python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install spacy

conda

You can also install spaCy from conda via the conda-forge channel. For the feedstock including the build recipe and configuration, check out this repository.

conda install -c conda-forge spacy

Updating spaCy

Some updates to spaCy may require downloading new statistical models. If you're running spaCy v2.0 or higher, you can use the validate command to check if your installed models are compatible and if not, print details on how to update them:

pip install -U spacy
python -m spacy validate

If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend retraining your models with the new version.

πŸ“– For details on upgrading from spaCy 2.x to spaCy 3.x, see the migration guide.

πŸ“¦ Download model packages

Trained pipelines for spaCy can be installed as Python packages. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's download command, or manually by pointing pip to a path or URL.

Documentation
Available Pipelines Detailed pipeline descriptions, accuracy figures and benchmarks.
Models Documentation Detailed usage and installation instructions.
Training How to train your own pipelines on your data.
# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm

# pip install .tar.gz archive or .whl from path or URL
pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

Loading and using models

To load a model, use spacy.load() with the model name or a path to the model data directory.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")

You can also import a model directly via its full name and then call its load() method with no arguments.

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")

πŸ“– For more info and examples, check out the models documentation.

βš’ Compile from source

The other way to install spaCy is to clone its GitHub repository and build it from source. That is the common way if you want to make changes to the code base. You'll need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, virtualenv and git installed. The compiler part is the trickiest. How to do that depends on your system.

Platform
Ubuntu Install system-level dependencies via apt-get: sudo apt-get install build-essential python-dev git .
Mac Install a recent version of XCode, including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled.
Windows Install a version of the Visual C++ Build Tools or Visual Studio Express that matches the version that was used to compile your Python interpreter.

For more details and instructions, see the documentation on compiling spaCy from source and the quickstart widget to get the right commands for your platform and Python version.

git clone https://github.com/explosion/spaCy
cd spaCy

python -m venv .env
source .env/bin/activate

# make sure you are using the latest pip
python -m pip install -U pip setuptools wheel

pip install -r requirements.txt
pip install --no-build-isolation --editable .

To install with extras:

pip install --no-build-isolation --editable .[lookups,cuda102]

🚦 Run tests

spaCy comes with an extensive test suite. In order to run the tests, you'll usually want to clone the repository and build spaCy from source. This will also install the required development dependencies and test utilities defined in the requirements.txt.

Alternatively, you can run pytest on the tests from within the installed spacy package. Don't forget to also install the test utilities via spaCy's requirements.txt:

pip install -r requirements.txt
python -m pytest --pyargs spacy

spacy's People

Contributors

2u62w4n6 avatar adrianeboyd avatar danieldk avatar duygua avatar essenmitsosse avatar explosion-bot avatar geovedi avatar github-actions[bot] avatar gregory-howard avatar henningpeters avatar honnibal avatar ines avatar jimregan avatar kadarakos avatar lfiedler avatar ljvmiranda921 avatar maxirmx avatar oroszgy avatar pmbaumgartner avatar polm avatar raphael0202 avatar rdmrcv avatar richardpaulhudson avatar rmitsch avatar shademe avatar sorenlind avatar svlandeg avatar syllog1sm avatar thomashacker avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spacy's Issues

ValueError: max() arg is an empty sequence

I was running spacy through some sentences and saw that the sentence below is throwing a ValuerError.

from spacy.en import English
spacy_nlp = English()
text = u'Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1).'
tokens = spacy_nlp(text)

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/spacy/en/init.py", line 195, in call
self.mwe_merger(tokens)
File "/usr/local/lib/python2.7/site-packages/spacy/multi_words.py", line 8, in call
tokens.merge(m.start(), m.end(), tag, m.group(), entity_type)
File "spacy/tokens.pyx", line 329, in spacy.tokens.Tokens.merge (spacy/tokens.cpp:6701)
ValueError: max() arg is an empty sequence

I have the newest spacy installed and up to date requirements.

Installing on a system with py2 and py3

I am building up a Dockerfile which includes spacy based on the docker/scipyserver.

The ipython developers support both python 2 and 3, so I figured I'd follow along and install spacy into both.

However, I have found that only the install that goes last actually sticks:

# install requirements
RUN pip install spacy && pip3 install spacy

# downloads a bunch of data
RUN python -m spacy.en.download && python3 -m spacy.en.download

This will always fail, as the python2 is no longer importable. Or more simply:

# install requirements
RUN pip install spacy && pip3 install spacy
RUN python -c "import spacy"

For now, I'm getting away with just supporting python3, but this might be something worth looking into.

Thanks for all the hard work!

Slang

Is there a method you would prefer for individuals to add slang to parts of speech library?
Would you allow for this/ how would I go about doing it?
i.e do contractions like 'sup?' get reduced to a verb or a noun?

Error while installing using pip

I'm getting a failure to install using pip install spacy AND using pip to install the cloned git, with the following reports during install:

Requirement already satisfied (use --upgrade to upgrade): cython in /usr/local/lib/python2.7/dist-packages (from -r requirements.txt (line 1))
Downloading/unpacking cymem>=1.11 (from -r requirements.txt (line 2))
  Downloading cymem-1.11.tar.gz
  Running setup.py (path:/tmp/pip_build_root/cymem/setup.py) egg_info for package cymem
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_root/cymem/setup.py", line 7, in <module>
        exts = cythonize([Extension("cymem.cymem", ["cymem/cymem.pyx"])])
      File "/usr/local/lib/python2.7/dist-packages/Cython/Distutils/extension.py", line 87, in __init__
        **kw)
    TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_root/cymem/setup.py", line 7, in <module>

    exts = cythonize([Extension("cymem.cymem", ["cymem/cymem.pyx"])])

  File "/usr/local/lib/python2.7/dist-packages/Cython/Distutils/extension.py", line 87, in __init__

    **kw)

TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/cymem

token boolean features

The example of boolean features (is_alpha, is_digit, is_lower, is_title, is_upper, is_ascii, is_punct, like_url, like_num) shown on the quickstart page doesn't work for me:

>>> lexeme = nlp.vocab[u'Apple']
>>> lexeme.is_alpha, is_upper
True, False
>>> tokens = nlp('Apple computers')
>>> tokens[0].is_alpha, tokens[0].is_upper
>>> True, False
>>> from spact.en.attrs import IS_ALPHA, IS_UPPER
>>> tokens.to_array((IS_ALPHA, IS_UPPER))[0]
array([1, 0])

For example, I'm getting: AttributeError: 'spacy.tokens.Token' object has no attribute 'is_alpha'

Instead, I'm calling from spacy.orth import * and then calling, for example, is_punct(token). There is also the inconsistency of spacy.orth.like_number as like_num listed under boolean features in the quickstart page.

AssertionError when parsing empty string

I happened to stumble over this while parsing a large dataset: spaCy throws an AssertionError when trying to parse an empty string.

Minimal Example:

from spacy.en import English
nlp = English()
nlp(u"")

results in

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    nlp(u"")
  File "/usr/local/lib/python2.7/dist-packages/spacy/en/__init__.py", line 149, in __call__
    self.parser(tokens)
  File "spacy/syntax/parser.pyx", line 77, in spacy.syntax.parser.GreedyParser.__call__ (spacy/syntax/parser.cpp:4122)
  File "spacy/syntax/_state.pyx", line 128, in spacy.syntax._state.init_state (spacy/syntax/_state.cpp:2715)
  File "spacy/syntax/_state.pyx", line 33, in spacy.syntax._state.push_stack (spacy/syntax/_state.cpp:1855)
AssertionError

Error when downloading data a second time

Steps to replicate:

  1. Create a virtual environment.
  2. Install spaCy.
  3. Download the data.
  4. Create a new virual environment.
  5. Download the data.

The new data is written into the same temporary folder as the old data, leading to an shutil error. It should probably either (1) check if the right data is already available and use it instead redownloading it, or (2) overwrite the existing data. Maybe you can track versions using their MD5 hash or the spaCy version or something like that (e.g., installing downloading data into /tmp/spaCy/v0.7/.

I'm on OS X 10.10.3, spaCy 0.70, Python 2.7.9.

failed on build

Hi Matthew,

Tried master branch earlier today, the requirements.txt is missing thinc and setup.py couldn't find murmurhash headers, so I have to copy the murmurhash import instruction from thinc.

Also missing humanize, unidecode, ujson modules for testing.

IndexError on printing token.orth_ / incorrect token printing token.lower_ ...but only in some cases

System specs: Linux Mint 16, 64-bit, Python 3.4, using Anaconda

stack trace:

Traceback (most recent call last):
  File "/blah/blah/test.py", line 324, in <module>
    induceFail()
  File "/blah/blah/test.py", line 319, in induceFail
    print(l.orth_)
  File "spacy/tokens.pyx", line 427, in spacy.tokens.Token.orth_.__get__ (spacy/tokens.cpp:8080)
  File "spacy/strings.pyx", line 71, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1631)
IndexError: 3481370184

Alright, this is a weird one. It involves passing tokens around in different lists, and only certain configurations will reliably cause the error. This error cropped up in a dataset I am actually using, where I wanted to pass the token objects around but only in certain cases. I got around this error by begrudgingly just passing the string representation of the token object. Anyway, I've created a small set of code that reliably induces the error on my machine so you can hopefully debug it:

from spacy.en import English

def blah(toks):
    ''' Take the tokens from nlp(), append them to a list, return the list '''
    lst = []
    for tok in toks:
        lst.append(tok)
        # printing the tok.orth_ at this point works just fine
        print(tok.orth_)
    return lst

def induceFail():
    nlp = English()
    samples = ["a", "test blah wat okay"]
    lst = []
    for sample in samples:
        # Go through all the samples, call nlp() on each to get tokens,
        # pass those tokens to the blah() function, get a list back and put all results in another list
        lst.extend(blah(nlp(sample)))
    # go through the list of all tokens and try to print orth_
    for l in lst:
        # here is where the error is thrown
        print(l.orth_)

induceFail()

Now, if you replace samples with the following sample, it works just fine!

samples = ["will this break", "test blah wat okay"]

And note that printing other attributes like pos_, dep_, and lower_ work without causing an IndexError, BUT don't print the correct thing, which leads me to believe some funny pointer dereferencing bug is causing this (the address is incorrect for certain symbols or something). It seems to only throw an error on the orth_ attribute. For example, I changed to printing lower_ and got the following output:

a
test
blah
wat
okay
a
test
blah
neighbour       <-------------- Wait...what? Neighbour isn't in any of the samples...
okay

Notice neighbour? Where the heck did it get that?. This is in an entirely new Python process. I'm not doing any multiprocessing / multithreading. I'm not using multiple instances of the nlp() object. So somehow an address got messed up, it's getting into memory it's not supposed to, but it is a valid address and it prints what is there? I have no idea.

So then I ran it again. And this time, instead of "neighbour" it was "Milk".

I hope the error is reproducible on your end. I apologize if it isn't. I swear I'm not crazy, you might just need to change the values in the samples list.

Unicode trouble with lemma_

from spacy.en import English()
nlp = English()
nlp(u'me…')[0].lemma_

results in an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 2: unexpected end of data

Unicode trouble with lemma_ still not fixed

As discussed in this issue: #32, printing lemmas of unicode words causes spaCy to crash. It was closed because I believe rsomeon did not want to contribute their patch, and instead preferred if you made your own changes. I don't believe the issue has been fixed, as I just tested it in v0.81 and I still get a crash.

from spacy.en import English
s = "FiancΓ©"
nlp = English()
tok = nlp(s)
print(tok[0].lemma_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens.pyx", line 585, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:10941)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1671)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5: unexpected end of data

Runtime error (max recursion depth exceeded) when printing tokens

Under python 3.4, spaCy 0.33, printing tokens from the spaCy example:

import spacy.en
nlp = spacy.en.English()
tokens = nlp("β€˜Give it back,’ he pleaded abjectly, β€˜it’s mine.’", tag=True, parse=False)                          
print(tokens) 

Fails with:

Traceback (most recent call last):
  File "t.py", line 7, in <module>
    print(tokens)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
... above repeats until ...
RuntimeError: maximum recursion depth exceeded while calling a Python object

Minor tokenization issue with lowercase 'i' + contraction

spaCy correctly tokenizes capital "I" + contraction ('d, 'm, 'll, 've) e.g.:

from spacy.en import English
nlp = English()
tok = nlp("I'm")
print([x.lower_ for x in tok])

>>> ['i', "'m"]

but when the "I" is a lowercase ("i") it does not tokenize into two tokens:

from spacy.en import English
nlp = English()
tok = nlp("i'm")
print([x.lower_ for x in tok])

>>> ["i'm"]

Not a big deal, and this may be the intent, since we don't know if the user meant capital "I", but I can't think of any problems that would happen if it tokenized the lowercase version into two.

Import/build issue

Both pip install and installing from source end "successfully," but

python -m spacy.en.download

as well as

from spacy.en import English
throw an ImportError
      5 from .. import orth
----> 6 from ..vocab import Vocab
      7 from ..tokenizer import Tokenizer
      8 from ..syntax.parser import GreedyParser

ImportError: dlopen(/Users/dk/anaconda/lib/python2.7/site-packages/spacy/vocab.so, 2): Symbol not found: __ZSt20__throw_length_errorPKc
  Referenced from: /Users/dk/anaconda/lib/python2.7/site-packages/spacy/vocab.so
  Expected in: dynamic lookup

I am using the conda dist of python on Mac 10.10

Python 2.7.8 |Anaconda 2.1.0 (x86_64)| (default, Aug 21 2014, 15:21:46)

and installed spaCy 0.40

This seems like a build issue, but not errors occur during... Also, I was able to install everything properly on a Ubuntu both with the same conda python dist.

If you have any ideas, that would be much appreciated.

Wrong repl output in quickstart doc

I just followed the quickstart and the last two lines of the following is wrong:

>>> from __future__ import unicode_literals # If Python 2
>>> from spacy.en import English
>>> nlp = English()
>>> tokens = nlp(u'I ate the pizza with anchovies.')
>>> pizza = tokens[3]
>>> (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
... (14702, 'pizza', 14702, 'ate')

Current version outputs:

In [13]: (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
Out[13]: (14702, u'pizza', 669, u'eat')

Thanks!

Question: NER with Spacy

Are there any plans of adding NER capabilities to Spacy soon? Any recommendations on the most modern techniques to do so, if not? (Eg, perhaps using the word vector representation?)

Problem with word frequency data?

When I follow the quick-start install steps, lexeme data exists at /usr/local/lib/python2.7/dist-packages/spacy/en/data/vocab/lexemes.bin, but somehow it isn't being loaded.

import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob  #=> 0.0
nlp.vocab[u'not'].prob  #=> 0.0

I've also tried loading manually with:

nlp.vocab.load_lexemes("/usr/local/lib/python2.7/dist-packages/spacy/en/data/vocab/lexemes.bin")

MurmurHash error when pip installing in $HOME

When trying to install spaCy into my home directory with

pip install --user spacy

I run into a problem with MurmurHash. Are there perhaps some hard coded paths in that package? Here's what I get from pip:

  Downloading spacy-0.40.tar.gz (24.3MB): 24.3MB downloaded
  Running setup.py (path:/tmp/pip_build_patrick/spacy/setup.py) egg_info for package spacy
    zip_safe flag not set; analyzing archive contents...
    headers_workaround.__init__: module references __file__

    Installed /tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg

    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_patrick/spacy/setup.py", line 138, in <module>
        main(MOD_NAMES, use_cython)
      File "/tmp/pip_build_patrick/spacy/setup.py", line 125, in main
        run_setup(exts)
      File "/tmp/pip_build_patrick/spacy/setup.py", line 113, in run_setup
        headers_workaround.install_headers('murmurhash')
      File "/tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg/headers_workaround/__init__.py", line 31, in install_headers
        shutil.copy(path.join(src_dir, filename), path.join(dest_dir, filename))
      File "/usr/lib/python2.7/shutil.py", line 119, in copy
        copyfile(src, dst)
      File "/usr/lib/python2.7/shutil.py", line 83, in copyfile
        with open(dst, 'wb') as fdst:
    IOError: [Errno 13] Permission denied: '/usr/include/murmurhash/MurmurHash3.h'
    Complete output from command python setup.py egg_info:
    zip_safe flag not set; analyzing archive contents...

headers_workaround.__init__: module references __file__



Installed /tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg

running egg_info

creating pip-egg-info/spacy.egg-info

writing requirements to pip-egg-info/spacy.egg-info/requires.txt

writing pip-egg-info/spacy.egg-info/PKG-INFO

writing top-level names to pip-egg-info/spacy.egg-info/top_level.txt

writing dependency_links to pip-egg-info/spacy.egg-info/dependency_links.txt

writing manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'

warning: manifest_maker: standard file '-c' not found



reading manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'

reading manifest template 'MANIFEST.in'

writing manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_patrick/spacy/setup.py", line 138, in <module>

    main(MOD_NAMES, use_cython)

  File "/tmp/pip_build_patrick/spacy/setup.py", line 125, in main

    run_setup(exts)

  File "/tmp/pip_build_patrick/spacy/setup.py", line 113, in run_setup

    headers_workaround.install_headers('murmurhash')

  File "/tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg/headers_workaround/__init__.py", line 31, in install_headers

    shutil.copy(path.join(src_dir, filename), path.join(dest_dir, filename))

  File "/usr/lib/python2.7/shutil.py", line 119, in copy

    copyfile(src, dst)

  File "/usr/lib/python2.7/shutil.py", line 83, in copyfile

    with open(dst, 'wb') as fdst:

IOError: [Errno 13] Permission denied: '/usr/include/murmurhash/MurmurHash3.h'

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_patrick/spacy
Storing debug log for failure in /home/patrick/.pip/pip.log

Collapsed Dependencies

Might you might allow for collapsed dependencies to easily be gotten from output of the dependency parser via some option?

spaCy v0.80 python3 incompatibility

In the new multi_words.py RegexMerger, there is a reference to unicode(tokens). In Python 3, there is no "unicode()" function, so it causes spaCy to crash.

from spacy.en import English
nlp = English()
tok = nlp("test test test")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spacy/en/__init__.py", line 195, in __call__
    self.mwe_merger(tokens)
  File "/spacy/multi_words.py", line 7, in __call__
    for m in regex.finditer(unicode(tokens)):
NameError: name 'unicode' is not defined

spaCy 0.8.2 Attribute Error: 'Config' object has no attribute 'labels'

Invocation of spacy.en.English tokenizer results in an AttributeError. Fresh install of spaCy 0.8.2 from PyPi.

Abbreviated stack trace:

  File "/python2.7/site-packages/spacy/en/__init__.py", line 191, in __call__
    self.parser(tokens)
  File "/python2.7/site-packages/spacy/en/__init__.py", line 191, in __call__
  File "/python2.7/site-packages/spacy/en/__init__.py", line 117, in parser
    self.parser(tokens)
    self.ParserTransitionSystem)
  File "/python2.7/site-packages/spacy/en/__init__.py", line 117, in parser
  File "spacy/syntax/parser.pyx", line 74, in spacy.syntax.parser.GreedyParser.__init__ (spacy/syntax/parser.cpp:4120)
    self.ParserTransitionSystem)
AttributeError: 'Config' object has no attribute 'labels'
  File "spacy/syntax/parser.pyx", line 74, in spacy.syntax.parser.GreedyParser.__init__ (spacy/syntax/parser.cpp:4120)
AttributeError: 'Config' object has no attribute 'labels'

Dependency Relation Types

I've noticed that the dependency relation types that I get from the dependency parser in spaCy are a bit different than the Stanford dependencies. Is there some kind of mapping between the these dependency types and the Stanford dependencies?

Idea: Switch orth / orth_, tag / tag_, etc from int / unicode to unicode / int (please debate)

SpaCy maintains a global mapping of strings to integers. Currently the integer-value is named "foo", and the string value is named "foo_".

I now think I prefer to have the string value named "foo", and the integer value named "foo_". I had convinced myself that it should be possible to avoid using the string attributes almost entirely, but in my own use, I find myself needing these attributes a lot.

What do you think? Should I keep token.orth as the integer ID, or should I move the integer IDs to the underscored attributes?

Tokenization fails on a single word

An example.

toks = nlp(u"foobar")
print [i.string for i in toks]
>> [u'']
toks = nlp(u"foo bar")
print [i.string for i in toks]
>> [u'foo ', u'bar']

NER never recognizes any entities. Is the NER model file provided?

Trying to run the given NER test case, I get no entities in the list. Additionally, if I try making up a test case I always get 0 entities in the returned list. I have downloaded the latest files using

python -m spacy.en.download

and am using v0.82 on Python 3.4

def test_simple_types():
    tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
    ents = list(tokens.ents)
    assert ents[0].start == 1
    assert ents[0].end == 2
    assert ents[0].label_ == 'PERSON'
    assert ents[1].start == 4
    assert ents[1].end == 6
    assert ents[1].label_ == 'GPE'
    assert ents[2].start == 7
    assert ents[2].end == 8
    assert ents[2].label_ == 'DATE'
    assert ents[3].start == 8
    assert ents[3].end == 9
    assert ents[3].label_ == 'TIME'
assert ents[0].start == 1
IndexError: list index out of range

value error on minimal tagging example

I created a minimal script:

import spacy.en
nlp = spacy.en.English()
tokens = nlp(u"The cow jumped over the moon.", tag=True, parse=False)

And then ran it under Python 2.7.9, spaCy 0.70, OS X 10.10.3.

It crashes, giving the following error:

❯ python test_spacy.py
Traceback (most recent call last):
  File "test_spacy.py", line 2, in <module>
    nlp = spacy.en.English()
  File "/Users/jordansuchow/.virtualenvs/spacy/lib/python2.7/site-packages/spacy/en/__init__.py", line 76, in __init__
    tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
  File "/Users/jordansuchow/.virtualenvs/spacy/lib/python2.7/site-packages/spacy/util.py", line 16, in read_lang_data
    tokenization = json.load(file_)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 567 column 1 (char 15406)

token.idx for punctuation characters is sometimes incorrect

If you parse a sentence like "to walk, do foo", the .idx for tokens[2] is 4 rather than the expected 7. Strangely this only seems to happen when the word before the punctuation mark is more than three characters long, and when that word is not the first word in the sentence (so e.g. "hello, world" works fine).

to_array + spacy.en.attrs.TAG = zero array

I've had some problems converting certain token attributes to numpy arrays. From what I understand, extracting TAG attributes should work as well as extracting POS attributes, but it doesn't. Following is minimal demonstration.

Also, it would be great if dependency types (and perhaps even token.head indices) could be extracted using the same API.

import spacy.en
import numpy as np

nlp = spacy.en.English()

toks = nlp(u"This is a simple sentence.", True, True)

print "Extracting google POS"
print np.array(toks.to_array([spacy.en.attrs.POS]))

print "Extracting detailed TAG doesn't"
print np.array(toks.to_array([spacy.en.attrs.TAG]))

print "Even though the detailed TAG is detected"
print [t.tag for t in toks]

Tokenizer splitting

I have a question/suggestion re tokenizer class and custom tokenizers.
I think it would be great to have the ability to have a custom split besides spaces and also to include new lines. Here are a couple of examples where this is an issue:

In [19]: tokens = nlp("I like green,blue and purple:)")

In [20]: for t in tokens:
    print('|'+t.string+'|', t.pos_)
   ....:     
|I | PRON
|like | VERB
|green,blue | ADJ
|and | CONJ
|purple| ADJ
|:| PUNCT
|)| PUNCT

and

In [21]: tokens = nlp("I like:\ngreen\nblue\npurple\n:)")
In [22]: for t in tokens:
    print('|'+t.string+'|', t.pos_)
   ....:     
|I | PRON
|like| VERB
|:| PUNCT
|
| ADV
|green| ADJ
|
| ADJ
|blue| ADJ
|
| ADJ
|purple| ADJ
|
| NOUN
|:)| PUNCT

Ideally we would retrain on a dataset that has new lines in it without stripping those and then label the new lines as such. Also, in "online writing" in many cases people tend to skip spaces when using punctuation. I am not sure if there are already any pre-tagged datasets where this is the case, but it would help a lot.

So the question is: what would be the easiest way to integrate these into existing code? So far i'm doing a workaround where I insert spaces if there isn't one already after a comma, but it feels dirty, and I'm not sure if I should be replacing new lines by spaces, because a good amount of information is lost in foregoing the distinction.

Thanks for the library by the way.

spacy/attrs.pxd missing

Hey- compile (in fact, Cython translation) is failing due to uncommitted file (that should have been in the repo for 9 days). I think since this is now being advertised, you should probably move to off-site CI and tagged releases...

ValueError spacy.en.English() instantiation, version 0.70

I just upgraded to 0.70, and when I try creating an instance of spacy.en.English, like this:

import spacy.en
nlp = spacy.en.English()

this happens:

ValueError Traceback (most recent call last)
in ()
----> 1 nlp = spacy.en.English()

//anaconda/lib/python2.7/site-packages/spacy/en/init.pyc in init(self, data_dir)
74 else:
75 tok_data_dir = path.join(data_dir, 'tokenizer')
---> 76 tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
77 prefix_re = re.compile(prefix_re)
78 suffix_re = re.compile(suffix_re)

//anaconda/lib/python2.7/site-packages/spacy/util.pyc in read_lang_data(data_dir)
14 def read_lang_data(data_dir):
15 with open(path.join(data_dir, 'specials.json')) as file_:
---> 16 tokenization = json.load(file_)
17 prefix = read_prefix(data_dir)
18 suffix = read_suffix(data_dir)

//anaconda/lib/python2.7/json/init.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, *_kw)
288 parse_float=parse_float, parse_int=parse_int,
289 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook,
--> 290 *_kw)
291
292

//anaconda/lib/python2.7/json/init.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 parse_int is None and parse_float is None and
337 parse_constant is None and object_pairs_hook is None and not kw):
--> 338 return _default_decoder.decode(s)
339 if cls is None:
340 cls = JSONDecoder

//anaconda/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
364
365 """
--> 366 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
367 end = _w(s, end).end()
368 if end != len(s):

//anaconda/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
380 """
381 try:
--> 382 obj, end = self.scan_once(s, idx)
383 except StopIteration:
384 raise ValueError("No JSON object could be decoded")

ValueError: Expecting property name: line 567 column 1 (char 15406)

Pronoun Detection

It seems that spaCy counts all pronouns as normal nouns.

import spacy.en
nlp = spacy.en.English()

In [34]: for tok in nlp(u"You and I make us"):
   ....:     print tok.string, tok.pos
   ....:     
You  6
and  4
I  6
make  10
us 6

In [35]: from spacy.parts_of_speech import PRON, NOUN

In [36]: NOUN, PRON
Out[36]: (6, 8)

Import Build issue, OSX

Spacy installed fine with pip. Run the download command and...

➜  text-processing  python -m spacy.en.download
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/usr/local/lib/python2.7/site-packages/spacy/en/__init__.py", line 6, in <module>
    from ..vocab import Vocab
  File ".env/lib/python2.7/site-packages/Cython/Includes/numpy/__init__.pxd", line 861, in init spacy.vocab (spacy/vocab.cpp:9066)
ValueError: numpy.ufunc has the wrong size, try recompiling

TypeError: unsupported operand type(s) for *

I'm getting

TypeError: unsupported operand type(s) for *: 'spacy.lexeme.Lexeme' and 'spacy.tokens.Token'

I'm running example code in iPython notebook.
I suspect it has something to do with a multiplication by 0/null etc.(?) from the empty vector array, which should have values:

In [6]:
pleaded.repvec[:5]

Out[6]:
array([ 0.,  0.,  0.,  0.,  0.], dtype=float32)
  • I upgraded to 0.4

full code and errors:

In [1]:
import spacy.en
from spacy.parts_of_speech import ADV
nlp = spacy.en.English()
In [2]:
# Load the pipeline, and call it with some text.

s = "'Give it back,' he pleaded abjectly, 'it’s mine.'"
s1 = s.decode('utf-8')

probs = [lex.prob for lex in nlp.vocab]
probs.sort()
is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
tokens = nlp(s1)
print(''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens))
'Give it back,' he pleaded ABJECTLY, 'it’s mine.'

In [3]:
b = 'back'
s2 = b.decode('utf-8')
nlp.vocab[s2].prob
Out[3]:
-7.403977394104004
In [4]:
pleaded = tokens[8]
In [5]:
pleaded.repvec.shape
Out[5]:
(300,)
In [6]:
pleaded.repvec[:5]
Out[6]:
array([ 0.,  0.,  0.,  0.,  0.], dtype=float32)
In [8]:
from numpy import dot
from numpy.linalg import norm
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
words = [w for w in nlp.vocab if w.lower]
words.sort(key=lambda w: cosine(w, pleaded))
words.reverse()

#print('1-20', ', '.join(w.orth_ for w in words[0:20]))
#print('50-60', ', '.join(w.orth_ for w in words[50:60]))
#print('100-110', ', '.join(w.orth_ for w in words[100:110]))
#print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
#print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-3dfcfec488f6> in <module>()
      3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
      4 words = [w for w in nlp.vocab if w.lower]
----> 5 words.sort(key=lambda w: cosine(w, pleaded))
      6 words.reverse()
      7 

<ipython-input-8-3dfcfec488f6> in <lambda>(w)
      3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
      4 words = [w for w in nlp.vocab if w.lower]
----> 5 words.sort(key=lambda w: cosine(w, pleaded))
      6 words.reverse()
      7 

<ipython-input-8-3dfcfec488f6> in <lambda>(v1, v2)
      1 from numpy import dot
      2 from numpy.linalg import norm
----> 3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
      4 words = [w for w in nlp.vocab if w.lower]
      5 words.sort(key=lambda w: cosine(w, pleaded))

TypeError: unsupported operand type(s) for *: 'spacy.lexeme.Lexeme' and 'spacy.tokens.Token'

Support for collapsed dependencies

It would be great to have support for collapsed dependencies similar to Stanford CoreNLP. For example in the sentence:

"I am moving to Florida"

"Florida" and "moving" aren't directly related because of the "to" particle(florida.head is to). I think the API can work like this:

>>> florida.dependencies(moving)
['prep_to']

Then one could do:

if 'prep_to' in florida.dependencies(moving):
   ...

token.lemma_ with the token "didn't" does not exist

The token "didn't" is correctly separated into ["did", "n't"], but in this case the lemma for "did" does not correctly register as "do", instead it is an empty string.
tokens = nlp("didn't")
print(tokens[0].lemma_)
>>> empty string

However it works when the token is just "did"
tokens = nlp("did")
print(tokens[0].lemma_)
>>> do

And "isn't" works perfectly, correctly being split as ["is", "n't"] with the lemma "be" for token[0]
tokens = nlp("isn't")
print(tokens[0].lemma_)
>>> be

Dependency parser missing dependencies / incorrectly parsing

The parser used to correctly parse several example sentences that I have, but it is now incorrectly parsing them. I'm not sure when it stopped working, since I hadn't checked its output in a while. I am on python 3.4, spacy v0.83, fully updated data with "python -m spacy.en.download all"

Examples of errors:
This one is from your blog (https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/)

def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])

toks = nlp("They ate the pizza with anchovies.")
printDeps(toks)
They SUB PRON [] []
ate  VERB ['They'] ['pizza', 'with', '.']      <------- error "with" is connected to "ate"
the NMOD DET [] []
pizza OBJ NOUN ['the'] []
with VMOD ADP [] ['anchovies']            <------- error "with" is categorized as verb modifier
anchovies PMOD NOUN [] []
. P PUNCT [] []
toks = nlp("i don't have other assistance")
printDeps(toks)
i SUB NOUN [] []
do  VERB ['i'] ["n't", 'have']    <---- Error "do"'s dep_ = "" and dep = 0
n't VMOD ADV [] []
have VC VERB [] ['assistance']
other NMOD ADJ [] []
assistance OBJ NOUN ['other'] []
toks = nlp("I have no other financial assistance available and he certainly won't provide support.")
printDeps(toks)
# add a comma and it works
toks = nlp("I have no other financial assistance available, and he certainly won't provide support.")
printDeps(toks)
I SUB PRON [] []
have VMOD VERB ['I'] ['available']    <------- Error, should have ['assistance'] in right deps
no NMOD DET [] []
other NMOD ADJ [] []
financial NMOD ADJ [] []
assistance SUB NOUN ['no', 'other', 'financial'] []   <----- Error, labeled as SUB not OBJ
available VMOD ADJ ['assistance'] []   <---- Error, labeled as VMOD rather than NMOD
and VMOD CONJ [] []
he SUB PRON [] []
certainly VMOD ADV [] []
wo  VERB ['have', 'and', 'he', 'certainly'] ["n't", 'provide', '.']  <---- Error, missing dep_
n't VMOD ADV [] []
provide VC VERB [] ['support']
support OBJ NOUN [] []
. P PUNCT [] []

I SUB PRON [] []
have VMOD VERB ['I'] ['assistance']
no NMOD DET [] []
other NMOD ADJ [] []
financial NMOD ADJ [] []
assistance OBJ NOUN ['no', 'other', 'financial'] ['available']
available NMOD ADJ [] []
, P PUNCT [] []
and VMOD CONJ [] []
he SUB PRON [] []
certainly VMOD ADV [] []
wo  VERB ['have', ',', 'and', 'he', 'certainly'] ["n't", 'provide', '.']   <--- Error, missing dep_
n't VMOD ADV [] []
provide VC VERB [] ['support']
support OBJ NOUN [] []
. P PUNCT [] []


Initializations fails (Python 3.4)

In [1]: import spacy.en

In [2]: nlp = spacy.en.English()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-c68587acddc5> in <module>()
----> 1 nlp = spacy.en.English()

/usr/local/lib/python3.4/site-packages/spacy/en/__init__.py in __init__(self, data_dir)
     74         else:
     75             tok_data_dir = path.join(data_dir, 'tokenizer')
---> 76             tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
     77             prefix_re = re.compile(prefix_re)
     78             suffix_re = re.compile(suffix_re)

/usr/local/lib/python3.4/site-packages/spacy/util.py in read_lang_data(data_dir)
     14 def read_lang_data(data_dir):
     15     with open(path.join(data_dir, 'specials.json')) as file_:
---> 16         tokenization = json.load(file_)
     17     prefix = read_prefix(data_dir)
     18     suffix = read_suffix(data_dir)

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    266         cls=cls, object_hook=object_hook,
    267         parse_float=parse_float, parse_int=parse_int,
--> 268         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
    269 
    270 

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    316             parse_int is None and parse_float is None and
    317             parse_constant is None and object_pairs_hook is None and not kw):
--> 318         return _default_decoder.decode(s)
    319     if cls is None:
    320         cls = JSONDecoder

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py in decode(self, s, _w)
    341 
    342         """
--> 343         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    344         end = _w(s, end).end()
    345         if end != len(s):

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py in raw_decode(self, s, idx)
    357         """
    358         try:
--> 359             obj, end = self.scan_once(s, idx)
    360         except StopIteration as err:
    361             raise ValueError(errmsg("Expecting value", s, err.value)) from None

ValueError: Expecting property name enclosed in double quotes: line 567 column 1 (char 15406)

Parsing API

I am finding it a little difficult to traverse the dependency parser output. I am using English(u"some text", True, True) to do the parsing. From the tokens output, there is no sibling() method on the tokens as described in the documents. From a token I can get the head, but pulling the children seems a little buggy. It doesn't always have all the children if I compute what they are from the head's alone. If I parse "the increasing levels of acidity bleached the coral", of has a head "levels" but levels doesn't have "of" in it's children. Also, when enumerating child(0) by incrementing the index value, when you've run out of children, it keeps outputting the last child rather than null, per se. Great repo overall, looking forward to using it more.

Questions RE your NER

I see spaCy has an NER now. Very nice. I'm curious about how it compares to other NER systems. Have you benchmarked it on a standard dataset? What algorithm are you using? How does it compare to MITIE and Stanford?

Training custom data

Could you provide some insights on training custom data sets for spaCy? Looking at the code in vocab.pyx, it makes me think that if not already, it should be easy to load the google word2vec data through spacy? (I'm not very familiar with that format just yet).

..
string_id = self.strings[chars[:word_len]]
while string_id >= vectors.size():
vectors.push_back(EMPTY_VEC)
assert vec != NULL
vectors[string_id] = vec
..

Discrepancy between sentence segmentation and parse trees

I've bumped into an issue where sentence segmentation (as given in Tokens.sents) doesn't match parse trees in that Tokens.sents do not separate independent parse trees. This behavior can be observed on the following text:

"It was a mere past six when we left Baker Street, and it still wanted ten minutes to the hour when we found ourselves in Serpentine Avenue."

I assume this is due to sentence segmentation being done by a separate classifier, so I wouldn't call it a bug, but it can be a usage problem, so I am reporting it. A few examples that I've checked manually indicate that parse-trees give better sentence segmentation then whatever Tokens.sents is based on.

My current workaround idea is to follow each token's dependancy tree path all the way up to the root, and then using the obtained root-node-array as sentence labels. This is however crude, ugly and inefficient, it would be nice to have a better solution.

Advertise Python 2/3 compatibility

I had to look to .travis.yml to get that information. It's a great selling point, easy to mention ("NLP with Python 2/3 and Cython") and one of the first things we look at when considering Python libraries. Also, please don't only bury that in the docs. :)

Make spacy.en.English() 10Γ— faster

I'd like to use spaCy as part of a command line utility that will run an analysis over a single document. The parsing and tagging is blazingly fast, which is great. But calling spacy.en.English() takes over a second on my system, which is 10Γ— too long for my purporses. Is there any hope for me?

Lemmatizer is converting all PRP tokens to the lemma -PRON-

No sure if this is a bug or a feature, but just thought I would note it:

>>>tokens = nlp(u"I see you.", tag=True, parse=False)
>>>tokens[0].lemma_
'-PRON-'

>>>tokens[2].lemma_
'-PRON-'

Shouldn't the lemmas be "I" and "you"?

Thanks for the nice work on spaCy!

-Cyrus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.