Giter VIP home page Giter VIP logo

explosion / spacy Goto Github PK

View Code? Open in Web Editor NEW
28.7K 555.0 4.3K 198.2 MB

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page: https://spacy.io

License: MIT License

Shell 0.01% Python 53.50% HTML 0.09% JavaScript 2.65% Makefile 0.02% Sass 0.78% Jinja 0.22% Cython 10.73% Dockerfile 0.01% C++ 0.01% C 0.13% TypeScript 0.36% MDX 31.50%
natural-language-processing data-science machine-learning python cython nlp artificial-intelligence ai spacy nlp-library

spacy's Issues

Error while installing using pip

I'm getting a failure to install using pip install spacy AND using pip to install the cloned git, with the following reports during install:

Requirement already satisfied (use --upgrade to upgrade): cython in /usr/local/lib/python2.7/dist-packages (from -r requirements.txt (line 1))
Downloading/unpacking cymem>=1.11 (from -r requirements.txt (line 2))
  Downloading cymem-1.11.tar.gz
  Running setup.py (path:/tmp/pip_build_root/cymem/setup.py) egg_info for package cymem
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_root/cymem/setup.py", line 7, in <module>
        exts = cythonize([Extension("cymem.cymem", ["cymem/cymem.pyx"])])
      File "/usr/local/lib/python2.7/dist-packages/Cython/Distutils/extension.py", line 87, in __init__
        **kw)
    TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_root/cymem/setup.py", line 7, in <module>

    exts = cythonize([Extension("cymem.cymem", ["cymem/cymem.pyx"])])

  File "/usr/local/lib/python2.7/dist-packages/Cython/Distutils/extension.py", line 87, in __init__

    **kw)

TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/cymem

to_array + spacy.en.attrs.TAG = zero array

I've had some problems converting certain token attributes to numpy arrays. From what I understand, extracting TAG attributes should work as well as extracting POS attributes, but it doesn't. Following is minimal demonstration.

Also, it would be great if dependency types (and perhaps even token.head indices) could be extracted using the same API.

import spacy.en
import numpy as np

nlp = spacy.en.English()

toks = nlp(u"This is a simple sentence.", True, True)

print "Extracting google POS"
print np.array(toks.to_array([spacy.en.attrs.POS]))

print "Extracting detailed TAG doesn't"
print np.array(toks.to_array([spacy.en.attrs.TAG]))

print "Even though the detailed TAG is detected"
print [t.tag for t in toks]

Advertise Python 2/3 compatibility

I had to look to .travis.yml to get that information. It's a great selling point, easy to mention ("NLP with Python 2/3 and Cython") and one of the first things we look at when considering Python libraries. Also, please don't only bury that in the docs. :)

IndexError on printing token.orth_ / incorrect token printing token.lower_ ...but only in some cases

System specs: Linux Mint 16, 64-bit, Python 3.4, using Anaconda

stack trace:

Traceback (most recent call last):
  File "/blah/blah/test.py", line 324, in <module>
    induceFail()
  File "/blah/blah/test.py", line 319, in induceFail
    print(l.orth_)
  File "spacy/tokens.pyx", line 427, in spacy.tokens.Token.orth_.__get__ (spacy/tokens.cpp:8080)
  File "spacy/strings.pyx", line 71, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1631)
IndexError: 3481370184

Alright, this is a weird one. It involves passing tokens around in different lists, and only certain configurations will reliably cause the error. This error cropped up in a dataset I am actually using, where I wanted to pass the token objects around but only in certain cases. I got around this error by begrudgingly just passing the string representation of the token object. Anyway, I've created a small set of code that reliably induces the error on my machine so you can hopefully debug it:

from spacy.en import English

def blah(toks):
    ''' Take the tokens from nlp(), append them to a list, return the list '''
    lst = []
    for tok in toks:
        lst.append(tok)
        # printing the tok.orth_ at this point works just fine
        print(tok.orth_)
    return lst

def induceFail():
    nlp = English()
    samples = ["a", "test blah wat okay"]
    lst = []
    for sample in samples:
        # Go through all the samples, call nlp() on each to get tokens,
        # pass those tokens to the blah() function, get a list back and put all results in another list
        lst.extend(blah(nlp(sample)))
    # go through the list of all tokens and try to print orth_
    for l in lst:
        # here is where the error is thrown
        print(l.orth_)

induceFail()

Now, if you replace samples with the following sample, it works just fine!

samples = ["will this break", "test blah wat okay"]

And note that printing other attributes like pos_, dep_, and lower_ work without causing an IndexError, BUT don't print the correct thing, which leads me to believe some funny pointer dereferencing bug is causing this (the address is incorrect for certain symbols or something). It seems to only throw an error on the orth_ attribute. For example, I changed to printing lower_ and got the following output:

a
test
blah
wat
okay
a
test
blah
neighbour       <-------------- Wait...what? Neighbour isn't in any of the samples...
okay

Notice neighbour? Where the heck did it get that?. This is in an entirely new Python process. I'm not doing any multiprocessing / multithreading. I'm not using multiple instances of the nlp() object. So somehow an address got messed up, it's getting into memory it's not supposed to, but it is a valid address and it prints what is there? I have no idea.

So then I ran it again. And this time, instead of "neighbour" it was "Milk".

I hope the error is reproducible on your end. I apologize if it isn't. I swear I'm not crazy, you might just need to change the values in the samples list.

Unicode trouble with lemma_ still not fixed

As discussed in this issue: #32, printing lemmas of unicode words causes spaCy to crash. It was closed because I believe rsomeon did not want to contribute their patch, and instead preferred if you made your own changes. I don't believe the issue has been fixed, as I just tested it in v0.81 and I still get a crash.

from spacy.en import English
s = "Fiancé"
nlp = English()
tok = nlp(s)
print(tok[0].lemma_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens.pyx", line 585, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:10941)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1671)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5: unexpected end of data

Import Build issue, OSX

Spacy installed fine with pip. Run the download command and...

➜  text-processing  python -m spacy.en.download
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/usr/local/lib/python2.7/site-packages/spacy/en/__init__.py", line 6, in <module>
    from ..vocab import Vocab
  File ".env/lib/python2.7/site-packages/Cython/Includes/numpy/__init__.pxd", line 861, in init spacy.vocab (spacy/vocab.cpp:9066)
ValueError: numpy.ufunc has the wrong size, try recompiling

spacy/attrs.pxd missing

Hey- compile (in fact, Cython translation) is failing due to uncommitted file (that should have been in the repo for 9 days). I think since this is now being advertised, you should probably move to off-site CI and tagged releases...

Dependency parser missing dependencies / incorrectly parsing

The parser used to correctly parse several example sentences that I have, but it is now incorrectly parsing them. I'm not sure when it stopped working, since I hadn't checked its output in a while. I am on python 3.4, spacy v0.83, fully updated data with "python -m spacy.en.download all"

Examples of errors:
This one is from your blog (https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/)

def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])

toks = nlp("They ate the pizza with anchovies.")
printDeps(toks)
They SUB PRON [] []
ate  VERB ['They'] ['pizza', 'with', '.']      <------- error "with" is connected to "ate"
the NMOD DET [] []
pizza OBJ NOUN ['the'] []
with VMOD ADP [] ['anchovies']            <------- error "with" is categorized as verb modifier
anchovies PMOD NOUN [] []
. P PUNCT [] []
toks = nlp("i don't have other assistance")
printDeps(toks)
i SUB NOUN [] []
do  VERB ['i'] ["n't", 'have']    <---- Error "do"'s dep_ = "" and dep = 0
n't VMOD ADV [] []
have VC VERB [] ['assistance']
other NMOD ADJ [] []
assistance OBJ NOUN ['other'] []
toks = nlp("I have no other financial assistance available and he certainly won't provide support.")
printDeps(toks)
# add a comma and it works
toks = nlp("I have no other financial assistance available, and he certainly won't provide support.")
printDeps(toks)
I SUB PRON [] []
have VMOD VERB ['I'] ['available']    <------- Error, should have ['assistance'] in right deps
no NMOD DET [] []
other NMOD ADJ [] []
financial NMOD ADJ [] []
assistance SUB NOUN ['no', 'other', 'financial'] []   <----- Error, labeled as SUB not OBJ
available VMOD ADJ ['assistance'] []   <---- Error, labeled as VMOD rather than NMOD
and VMOD CONJ [] []
he SUB PRON [] []
certainly VMOD ADV [] []
wo  VERB ['have', 'and', 'he', 'certainly'] ["n't", 'provide', '.']  <---- Error, missing dep_
n't VMOD ADV [] []
provide VC VERB [] ['support']
support OBJ NOUN [] []
. P PUNCT [] []

I SUB PRON [] []
have VMOD VERB ['I'] ['assistance']
no NMOD DET [] []
other NMOD ADJ [] []
financial NMOD ADJ [] []
assistance OBJ NOUN ['no', 'other', 'financial'] ['available']
available NMOD ADJ [] []
, P PUNCT [] []
and VMOD CONJ [] []
he SUB PRON [] []
certainly VMOD ADV [] []
wo  VERB ['have', ',', 'and', 'he', 'certainly'] ["n't", 'provide', '.']   <--- Error, missing dep_
n't VMOD ADV [] []
provide VC VERB [] ['support']
support OBJ NOUN [] []
. P PUNCT [] []


Question: NER with Spacy

Are there any plans of adding NER capabilities to Spacy soon? Any recommendations on the most modern techniques to do so, if not? (Eg, perhaps using the word vector representation?)

Error when downloading data a second time

Steps to replicate:

  1. Create a virtual environment.
  2. Install spaCy.
  3. Download the data.
  4. Create a new virual environment.
  5. Download the data.

The new data is written into the same temporary folder as the old data, leading to an shutil error. It should probably either (1) check if the right data is already available and use it instead redownloading it, or (2) overwrite the existing data. Maybe you can track versions using their MD5 hash or the spaCy version or something like that (e.g., installing downloading data into /tmp/spaCy/v0.7/.

I'm on OS X 10.10.3, spaCy 0.70, Python 2.7.9.

Wrong repl output in quickstart doc

I just followed the quickstart and the last two lines of the following is wrong:

>>> from __future__ import unicode_literals # If Python 2
>>> from spacy.en import English
>>> nlp = English()
>>> tokens = nlp(u'I ate the pizza with anchovies.')
>>> pizza = tokens[3]
>>> (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
... (14702, 'pizza', 14702, 'ate')

Current version outputs:

In [13]: (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
Out[13]: (14702, u'pizza', 669, u'eat')

Thanks!

Collapsed Dependencies

Might you might allow for collapsed dependencies to easily be gotten from output of the dependency parser via some option?

Lemmatizer is converting all PRP tokens to the lemma -PRON-

No sure if this is a bug or a feature, but just thought I would note it:

>>>tokens = nlp(u"I see you.", tag=True, parse=False)
>>>tokens[0].lemma_
'-PRON-'

>>>tokens[2].lemma_
'-PRON-'

Shouldn't the lemmas be "I" and "you"?

Thanks for the nice work on spaCy!

-Cyrus

Runtime error (max recursion depth exceeded) when printing tokens

Under python 3.4, spaCy 0.33, printing tokens from the spaCy example:

import spacy.en
nlp = spacy.en.English()
tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)                          
print(tokens) 

Fails with:

Traceback (most recent call last):
  File "t.py", line 7, in <module>
    print(tokens)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
  File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
... above repeats until ...
RuntimeError: maximum recursion depth exceeded while calling a Python object

Make spacy.en.English() 10× faster

I'd like to use spaCy as part of a command line utility that will run an analysis over a single document. The parsing and tagging is blazingly fast, which is great. But calling spacy.en.English() takes over a second on my system, which is 10× too long for my purporses. Is there any hope for me?

MurmurHash error when pip installing in $HOME

When trying to install spaCy into my home directory with

pip install --user spacy

I run into a problem with MurmurHash. Are there perhaps some hard coded paths in that package? Here's what I get from pip:

  Downloading spacy-0.40.tar.gz (24.3MB): 24.3MB downloaded
  Running setup.py (path:/tmp/pip_build_patrick/spacy/setup.py) egg_info for package spacy
    zip_safe flag not set; analyzing archive contents...
    headers_workaround.__init__: module references __file__

    Installed /tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg

    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_patrick/spacy/setup.py", line 138, in <module>
        main(MOD_NAMES, use_cython)
      File "/tmp/pip_build_patrick/spacy/setup.py", line 125, in main
        run_setup(exts)
      File "/tmp/pip_build_patrick/spacy/setup.py", line 113, in run_setup
        headers_workaround.install_headers('murmurhash')
      File "/tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg/headers_workaround/__init__.py", line 31, in install_headers
        shutil.copy(path.join(src_dir, filename), path.join(dest_dir, filename))
      File "/usr/lib/python2.7/shutil.py", line 119, in copy
        copyfile(src, dst)
      File "/usr/lib/python2.7/shutil.py", line 83, in copyfile
        with open(dst, 'wb') as fdst:
    IOError: [Errno 13] Permission denied: '/usr/include/murmurhash/MurmurHash3.h'
    Complete output from command python setup.py egg_info:
    zip_safe flag not set; analyzing archive contents...

headers_workaround.__init__: module references __file__



Installed /tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg

running egg_info

creating pip-egg-info/spacy.egg-info

writing requirements to pip-egg-info/spacy.egg-info/requires.txt

writing pip-egg-info/spacy.egg-info/PKG-INFO

writing top-level names to pip-egg-info/spacy.egg-info/top_level.txt

writing dependency_links to pip-egg-info/spacy.egg-info/dependency_links.txt

writing manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'

warning: manifest_maker: standard file '-c' not found



reading manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'

reading manifest template 'MANIFEST.in'

writing manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_patrick/spacy/setup.py", line 138, in <module>

    main(MOD_NAMES, use_cython)

  File "/tmp/pip_build_patrick/spacy/setup.py", line 125, in main

    run_setup(exts)

  File "/tmp/pip_build_patrick/spacy/setup.py", line 113, in run_setup

    headers_workaround.install_headers('murmurhash')

  File "/tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg/headers_workaround/__init__.py", line 31, in install_headers

    shutil.copy(path.join(src_dir, filename), path.join(dest_dir, filename))

  File "/usr/lib/python2.7/shutil.py", line 119, in copy

    copyfile(src, dst)

  File "/usr/lib/python2.7/shutil.py", line 83, in copyfile

    with open(dst, 'wb') as fdst:

IOError: [Errno 13] Permission denied: '/usr/include/murmurhash/MurmurHash3.h'

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_patrick/spacy
Storing debug log for failure in /home/patrick/.pip/pip.log

Training custom data

Could you provide some insights on training custom data sets for spaCy? Looking at the code in vocab.pyx, it makes me think that if not already, it should be easy to load the google word2vec data through spacy? (I'm not very familiar with that format just yet).

..
string_id = self.strings[chars[:word_len]]
while string_id >= vectors.size():
vectors.push_back(EMPTY_VEC)
assert vec != NULL
vectors[string_id] = vec
..

Idea: Switch orth / orth_, tag / tag_, etc from int / unicode to unicode / int (please debate)

SpaCy maintains a global mapping of strings to integers. Currently the integer-value is named "foo", and the string value is named "foo_".

I now think I prefer to have the string value named "foo", and the integer value named "foo_". I had convinced myself that it should be possible to avoid using the string attributes almost entirely, but in my own use, I find myself needing these attributes a lot.

What do you think? Should I keep token.orth as the integer ID, or should I move the integer IDs to the underscored attributes?

Dependency Relation Types

I've noticed that the dependency relation types that I get from the dependency parser in spaCy are a bit different than the Stanford dependencies. Is there some kind of mapping between the these dependency types and the Stanford dependencies?

Installing on a system with py2 and py3

I am building up a Dockerfile which includes spacy based on the docker/scipyserver.

The ipython developers support both python 2 and 3, so I figured I'd follow along and install spacy into both.

However, I have found that only the install that goes last actually sticks:

# install requirements
RUN pip install spacy && pip3 install spacy

# downloads a bunch of data
RUN python -m spacy.en.download && python3 -m spacy.en.download

This will always fail, as the python2 is no longer importable. Or more simply:

# install requirements
RUN pip install spacy && pip3 install spacy
RUN python -c "import spacy"

For now, I'm getting away with just supporting python3, but this might be something worth looking into.

Thanks for all the hard work!

Problem with word frequency data?

When I follow the quick-start install steps, lexeme data exists at /usr/local/lib/python2.7/dist-packages/spacy/en/data/vocab/lexemes.bin, but somehow it isn't being loaded.

import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob  #=> 0.0
nlp.vocab[u'not'].prob  #=> 0.0

I've also tried loading manually with:

nlp.vocab.load_lexemes("/usr/local/lib/python2.7/dist-packages/spacy/en/data/vocab/lexemes.bin")

ValueError spacy.en.English() instantiation, version 0.70

I just upgraded to 0.70, and when I try creating an instance of spacy.en.English, like this:

import spacy.en
nlp = spacy.en.English()

this happens:

ValueError Traceback (most recent call last)
in ()
----> 1 nlp = spacy.en.English()

//anaconda/lib/python2.7/site-packages/spacy/en/init.pyc in init(self, data_dir)
74 else:
75 tok_data_dir = path.join(data_dir, 'tokenizer')
---> 76 tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
77 prefix_re = re.compile(prefix_re)
78 suffix_re = re.compile(suffix_re)

//anaconda/lib/python2.7/site-packages/spacy/util.pyc in read_lang_data(data_dir)
14 def read_lang_data(data_dir):
15 with open(path.join(data_dir, 'specials.json')) as file_:
---> 16 tokenization = json.load(file_)
17 prefix = read_prefix(data_dir)
18 suffix = read_suffix(data_dir)

//anaconda/lib/python2.7/json/init.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, *_kw)
288 parse_float=parse_float, parse_int=parse_int,
289 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook,
--> 290 *_kw)
291
292

//anaconda/lib/python2.7/json/init.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 parse_int is None and parse_float is None and
337 parse_constant is None and object_pairs_hook is None and not kw):
--> 338 return _default_decoder.decode(s)
339 if cls is None:
340 cls = JSONDecoder

//anaconda/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
364
365 """
--> 366 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
367 end = _w(s, end).end()
368 if end != len(s):

//anaconda/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
380 """
381 try:
--> 382 obj, end = self.scan_once(s, idx)
383 except StopIteration:
384 raise ValueError("No JSON object could be decoded")

ValueError: Expecting property name: line 567 column 1 (char 15406)

Minor tokenization issue with lowercase 'i' + contraction

spaCy correctly tokenizes capital "I" + contraction ('d, 'm, 'll, 've) e.g.:

from spacy.en import English
nlp = English()
tok = nlp("I'm")
print([x.lower_ for x in tok])

>>> ['i', "'m"]

but when the "I" is a lowercase ("i") it does not tokenize into two tokens:

from spacy.en import English
nlp = English()
tok = nlp("i'm")
print([x.lower_ for x in tok])

>>> ["i'm"]

Not a big deal, and this may be the intent, since we don't know if the user meant capital "I", but I can't think of any problems that would happen if it tokenized the lowercase version into two.

Tokenizer splitting

I have a question/suggestion re tokenizer class and custom tokenizers.
I think it would be great to have the ability to have a custom split besides spaces and also to include new lines. Here are a couple of examples where this is an issue:

In [19]: tokens = nlp("I like green,blue and purple:)")

In [20]: for t in tokens:
    print('|'+t.string+'|', t.pos_)
   ....:     
|I | PRON
|like | VERB
|green,blue | ADJ
|and | CONJ
|purple| ADJ
|:| PUNCT
|)| PUNCT

and

In [21]: tokens = nlp("I like:\ngreen\nblue\npurple\n:)")
In [22]: for t in tokens:
    print('|'+t.string+'|', t.pos_)
   ....:     
|I | PRON
|like| VERB
|:| PUNCT
|
| ADV
|green| ADJ
|
| ADJ
|blue| ADJ
|
| ADJ
|purple| ADJ
|
| NOUN
|:)| PUNCT

Ideally we would retrain on a dataset that has new lines in it without stripping those and then label the new lines as such. Also, in "online writing" in many cases people tend to skip spaces when using punctuation. I am not sure if there are already any pre-tagged datasets where this is the case, but it would help a lot.

So the question is: what would be the easiest way to integrate these into existing code? So far i'm doing a workaround where I insert spaces if there isn't one already after a comma, but it feels dirty, and I'm not sure if I should be replacing new lines by spaces, because a good amount of information is lost in foregoing the distinction.

Thanks for the library by the way.

Unicode trouble with lemma_

from spacy.en import English()
nlp = English()
nlp(u'me…')[0].lemma_

results in an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 2: unexpected end of data

token boolean features

The example of boolean features (is_alpha, is_digit, is_lower, is_title, is_upper, is_ascii, is_punct, like_url, like_num) shown on the quickstart page doesn't work for me:

>>> lexeme = nlp.vocab[u'Apple']
>>> lexeme.is_alpha, is_upper
True, False
>>> tokens = nlp('Apple computers')
>>> tokens[0].is_alpha, tokens[0].is_upper
>>> True, False
>>> from spact.en.attrs import IS_ALPHA, IS_UPPER
>>> tokens.to_array((IS_ALPHA, IS_UPPER))[0]
array([1, 0])

For example, I'm getting: AttributeError: 'spacy.tokens.Token' object has no attribute 'is_alpha'

Instead, I'm calling from spacy.orth import * and then calling, for example, is_punct(token). There is also the inconsistency of spacy.orth.like_number as like_num listed under boolean features in the quickstart page.

spaCy 0.8.2 Attribute Error: 'Config' object has no attribute 'labels'

Invocation of spacy.en.English tokenizer results in an AttributeError. Fresh install of spaCy 0.8.2 from PyPi.

Abbreviated stack trace:

  File "/python2.7/site-packages/spacy/en/__init__.py", line 191, in __call__
    self.parser(tokens)
  File "/python2.7/site-packages/spacy/en/__init__.py", line 191, in __call__
  File "/python2.7/site-packages/spacy/en/__init__.py", line 117, in parser
    self.parser(tokens)
    self.ParserTransitionSystem)
  File "/python2.7/site-packages/spacy/en/__init__.py", line 117, in parser
  File "spacy/syntax/parser.pyx", line 74, in spacy.syntax.parser.GreedyParser.__init__ (spacy/syntax/parser.cpp:4120)
    self.ParserTransitionSystem)
AttributeError: 'Config' object has no attribute 'labels'
  File "spacy/syntax/parser.pyx", line 74, in spacy.syntax.parser.GreedyParser.__init__ (spacy/syntax/parser.cpp:4120)
AttributeError: 'Config' object has no attribute 'labels'

token.idx for punctuation characters is sometimes incorrect

If you parse a sentence like "to walk, do foo", the .idx for tokens[2] is 4 rather than the expected 7. Strangely this only seems to happen when the word before the punctuation mark is more than three characters long, and when that word is not the first word in the sentence (so e.g. "hello, world" works fine).

spaCy v0.80 python3 incompatibility

In the new multi_words.py RegexMerger, there is a reference to unicode(tokens). In Python 3, there is no "unicode()" function, so it causes spaCy to crash.

from spacy.en import English
nlp = English()
tok = nlp("test test test")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spacy/en/__init__.py", line 195, in __call__
    self.mwe_merger(tokens)
  File "/spacy/multi_words.py", line 7, in __call__
    for m in regex.finditer(unicode(tokens)):
NameError: name 'unicode' is not defined

NER never recognizes any entities. Is the NER model file provided?

Trying to run the given NER test case, I get no entities in the list. Additionally, if I try making up a test case I always get 0 entities in the returned list. I have downloaded the latest files using

python -m spacy.en.download

and am using v0.82 on Python 3.4

def test_simple_types():
    tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
    ents = list(tokens.ents)
    assert ents[0].start == 1
    assert ents[0].end == 2
    assert ents[0].label_ == 'PERSON'
    assert ents[1].start == 4
    assert ents[1].end == 6
    assert ents[1].label_ == 'GPE'
    assert ents[2].start == 7
    assert ents[2].end == 8
    assert ents[2].label_ == 'DATE'
    assert ents[3].start == 8
    assert ents[3].end == 9
    assert ents[3].label_ == 'TIME'
assert ents[0].start == 1
IndexError: list index out of range

Parsing API

I am finding it a little difficult to traverse the dependency parser output. I am using English(u"some text", True, True) to do the parsing. From the tokens output, there is no sibling() method on the tokens as described in the documents. From a token I can get the head, but pulling the children seems a little buggy. It doesn't always have all the children if I compute what they are from the head's alone. If I parse "the increasing levels of acidity bleached the coral", of has a head "levels" but levels doesn't have "of" in it's children. Also, when enumerating child(0) by incrementing the index value, when you've run out of children, it keeps outputting the last child rather than null, per se. Great repo overall, looking forward to using it more.

Slang

Is there a method you would prefer for individuals to add slang to parts of speech library?
Would you allow for this/ how would I go about doing it?
i.e do contractions like 'sup?' get reduced to a verb or a noun?

Import/build issue

Both pip install and installing from source end "successfully," but

python -m spacy.en.download

as well as

from spacy.en import English
throw an ImportError
      5 from .. import orth
----> 6 from ..vocab import Vocab
      7 from ..tokenizer import Tokenizer
      8 from ..syntax.parser import GreedyParser

ImportError: dlopen(/Users/dk/anaconda/lib/python2.7/site-packages/spacy/vocab.so, 2): Symbol not found: __ZSt20__throw_length_errorPKc
  Referenced from: /Users/dk/anaconda/lib/python2.7/site-packages/spacy/vocab.so
  Expected in: dynamic lookup

I am using the conda dist of python on Mac 10.10

Python 2.7.8 |Anaconda 2.1.0 (x86_64)| (default, Aug 21 2014, 15:21:46)

and installed spaCy 0.40

This seems like a build issue, but not errors occur during... Also, I was able to install everything properly on a Ubuntu both with the same conda python dist.

If you have any ideas, that would be much appreciated.

Initializations fails (Python 3.4)

In [1]: import spacy.en

In [2]: nlp = spacy.en.English()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-c68587acddc5> in <module>()
----> 1 nlp = spacy.en.English()

/usr/local/lib/python3.4/site-packages/spacy/en/__init__.py in __init__(self, data_dir)
     74         else:
     75             tok_data_dir = path.join(data_dir, 'tokenizer')
---> 76             tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
     77             prefix_re = re.compile(prefix_re)
     78             suffix_re = re.compile(suffix_re)

/usr/local/lib/python3.4/site-packages/spacy/util.py in read_lang_data(data_dir)
     14 def read_lang_data(data_dir):
     15     with open(path.join(data_dir, 'specials.json')) as file_:
---> 16         tokenization = json.load(file_)
     17     prefix = read_prefix(data_dir)
     18     suffix = read_suffix(data_dir)

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    266         cls=cls, object_hook=object_hook,
    267         parse_float=parse_float, parse_int=parse_int,
--> 268         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
    269 
    270 

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    316             parse_int is None and parse_float is None and
    317             parse_constant is None and object_pairs_hook is None and not kw):
--> 318         return _default_decoder.decode(s)
    319     if cls is None:
    320         cls = JSONDecoder

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py in decode(self, s, _w)
    341 
    342         """
--> 343         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    344         end = _w(s, end).end()
    345         if end != len(s):

/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py in raw_decode(self, s, idx)
    357         """
    358         try:
--> 359             obj, end = self.scan_once(s, idx)
    360         except StopIteration as err:
    361             raise ValueError(errmsg("Expecting value", s, err.value)) from None

ValueError: Expecting property name enclosed in double quotes: line 567 column 1 (char 15406)

AssertionError when parsing empty string

I happened to stumble over this while parsing a large dataset: spaCy throws an AssertionError when trying to parse an empty string.

Minimal Example:

from spacy.en import English
nlp = English()
nlp(u"")

results in

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    nlp(u"")
  File "/usr/local/lib/python2.7/dist-packages/spacy/en/__init__.py", line 149, in __call__
    self.parser(tokens)
  File "spacy/syntax/parser.pyx", line 77, in spacy.syntax.parser.GreedyParser.__call__ (spacy/syntax/parser.cpp:4122)
  File "spacy/syntax/_state.pyx", line 128, in spacy.syntax._state.init_state (spacy/syntax/_state.cpp:2715)
  File "spacy/syntax/_state.pyx", line 33, in spacy.syntax._state.push_stack (spacy/syntax/_state.cpp:1855)
AssertionError

token.lemma_ with the token "didn't" does not exist

The token "didn't" is correctly separated into ["did", "n't"], but in this case the lemma for "did" does not correctly register as "do", instead it is an empty string.
tokens = nlp("didn't")
print(tokens[0].lemma_)
>>> empty string

However it works when the token is just "did"
tokens = nlp("did")
print(tokens[0].lemma_)
>>> do

And "isn't" works perfectly, correctly being split as ["is", "n't"] with the lemma "be" for token[0]
tokens = nlp("isn't")
print(tokens[0].lemma_)
>>> be

ValueError: max() arg is an empty sequence

I was running spacy through some sentences and saw that the sentence below is throwing a ValuerError.

from spacy.en import English
spacy_nlp = English()
text = u'Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1).'
tokens = spacy_nlp(text)

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/spacy/en/init.py", line 195, in call
self.mwe_merger(tokens)
File "/usr/local/lib/python2.7/site-packages/spacy/multi_words.py", line 8, in call
tokens.merge(m.start(), m.end(), tag, m.group(), entity_type)
File "spacy/tokens.pyx", line 329, in spacy.tokens.Tokens.merge (spacy/tokens.cpp:6701)
ValueError: max() arg is an empty sequence

I have the newest spacy installed and up to date requirements.

TypeError: unsupported operand type(s) for *

I'm getting

TypeError: unsupported operand type(s) for *: 'spacy.lexeme.Lexeme' and 'spacy.tokens.Token'

I'm running example code in iPython notebook.
I suspect it has something to do with a multiplication by 0/null etc.(?) from the empty vector array, which should have values:

In [6]:
pleaded.repvec[:5]

Out[6]:
array([ 0.,  0.,  0.,  0.,  0.], dtype=float32)
  • I upgraded to 0.4

full code and errors:

In [1]:
import spacy.en
from spacy.parts_of_speech import ADV
nlp = spacy.en.English()
In [2]:
# Load the pipeline, and call it with some text.

s = "'Give it back,' he pleaded abjectly, 'it’s mine.'"
s1 = s.decode('utf-8')

probs = [lex.prob for lex in nlp.vocab]
probs.sort()
is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
tokens = nlp(s1)
print(''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens))
'Give it back,' he pleaded ABJECTLY, 'it’s mine.'

In [3]:
b = 'back'
s2 = b.decode('utf-8')
nlp.vocab[s2].prob
Out[3]:
-7.403977394104004
In [4]:
pleaded = tokens[8]
In [5]:
pleaded.repvec.shape
Out[5]:
(300,)
In [6]:
pleaded.repvec[:5]
Out[6]:
array([ 0.,  0.,  0.,  0.,  0.], dtype=float32)
In [8]:
from numpy import dot
from numpy.linalg import norm
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
words = [w for w in nlp.vocab if w.lower]
words.sort(key=lambda w: cosine(w, pleaded))
words.reverse()

#print('1-20', ', '.join(w.orth_ for w in words[0:20]))
#print('50-60', ', '.join(w.orth_ for w in words[50:60]))
#print('100-110', ', '.join(w.orth_ for w in words[100:110]))
#print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
#print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-3dfcfec488f6> in <module>()
      3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
      4 words = [w for w in nlp.vocab if w.lower]
----> 5 words.sort(key=lambda w: cosine(w, pleaded))
      6 words.reverse()
      7 

<ipython-input-8-3dfcfec488f6> in <lambda>(w)
      3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
      4 words = [w for w in nlp.vocab if w.lower]
----> 5 words.sort(key=lambda w: cosine(w, pleaded))
      6 words.reverse()
      7 

<ipython-input-8-3dfcfec488f6> in <lambda>(v1, v2)
      1 from numpy import dot
      2 from numpy.linalg import norm
----> 3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
      4 words = [w for w in nlp.vocab if w.lower]
      5 words.sort(key=lambda w: cosine(w, pleaded))

TypeError: unsupported operand type(s) for *: 'spacy.lexeme.Lexeme' and 'spacy.tokens.Token'

Pronoun Detection

It seems that spaCy counts all pronouns as normal nouns.

import spacy.en
nlp = spacy.en.English()

In [34]: for tok in nlp(u"You and I make us"):
   ....:     print tok.string, tok.pos
   ....:     
You  6
and  4
I  6
make  10
us 6

In [35]: from spacy.parts_of_speech import PRON, NOUN

In [36]: NOUN, PRON
Out[36]: (6, 8)

value error on minimal tagging example

I created a minimal script:

import spacy.en
nlp = spacy.en.English()
tokens = nlp(u"The cow jumped over the moon.", tag=True, parse=False)

And then ran it under Python 2.7.9, spaCy 0.70, OS X 10.10.3.

It crashes, giving the following error:

❯ python test_spacy.py
Traceback (most recent call last):
  File "test_spacy.py", line 2, in <module>
    nlp = spacy.en.English()
  File "/Users/jordansuchow/.virtualenvs/spacy/lib/python2.7/site-packages/spacy/en/__init__.py", line 76, in __init__
    tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
  File "/Users/jordansuchow/.virtualenvs/spacy/lib/python2.7/site-packages/spacy/util.py", line 16, in read_lang_data
    tokenization = json.load(file_)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 567 column 1 (char 15406)

Tokenization fails on a single word

An example.

toks = nlp(u"foobar")
print [i.string for i in toks]
>> [u'']
toks = nlp(u"foo bar")
print [i.string for i in toks]
>> [u'foo ', u'bar']

failed on build

Hi Matthew,

Tried master branch earlier today, the requirements.txt is missing thinc and setup.py couldn't find murmurhash headers, so I have to copy the murmurhash import instruction from thinc.

Also missing humanize, unidecode, ujson modules for testing.

Support for collapsed dependencies

It would be great to have support for collapsed dependencies similar to Stanford CoreNLP. For example in the sentence:

"I am moving to Florida"

"Florida" and "moving" aren't directly related because of the "to" particle(florida.head is to). I think the API can work like this:

>>> florida.dependencies(moving)
['prep_to']

Then one could do:

if 'prep_to' in florida.dependencies(moving):
   ...

Discrepancy between sentence segmentation and parse trees

I've bumped into an issue where sentence segmentation (as given in Tokens.sents) doesn't match parse trees in that Tokens.sents do not separate independent parse trees. This behavior can be observed on the following text:

"It was a mere past six when we left Baker Street, and it still wanted ten minutes to the hour when we found ourselves in Serpentine Avenue."

I assume this is due to sentence segmentation being done by a separate classifier, so I wouldn't call it a bug, but it can be a usage problem, so I am reporting it. A few examples that I've checked manually indicate that parse-trees give better sentence segmentation then whatever Tokens.sents is based on.

My current workaround idea is to follow each token's dependancy tree path all the way up to the root, and then using the obtained root-node-array as sentence labels. This is however crude, ugly and inefficient, it would be nice to have a better solution.

Questions RE your NER

I see spaCy has an NER now. Very nice. I'm curious about how it compares to other NER systems. Have you benchmarked it on a standard dataset? What algorithm are you using? How does it compare to MITIE and Stanford?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.