explosion / spacy Goto Github PK
View Code? Open in Web Editor NEW💫 Industrial-strength Natural Language Processing (NLP) in Python
Home Page: https://spacy.io
License: MIT License
💫 Industrial-strength Natural Language Processing (NLP) in Python
Home Page: https://spacy.io
License: MIT License
I'm getting a failure to install using pip install spacy AND using pip to install the cloned git, with the following reports during install:
Requirement already satisfied (use --upgrade to upgrade): cython in /usr/local/lib/python2.7/dist-packages (from -r requirements.txt (line 1))
Downloading/unpacking cymem>=1.11 (from -r requirements.txt (line 2))
Downloading cymem-1.11.tar.gz
Running setup.py (path:/tmp/pip_build_root/cymem/setup.py) egg_info for package cymem
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_root/cymem/setup.py", line 7, in <module>
exts = cythonize([Extension("cymem.cymem", ["cymem/cymem.pyx"])])
File "/usr/local/lib/python2.7/dist-packages/Cython/Distutils/extension.py", line 87, in __init__
**kw)
TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_root/cymem/setup.py", line 7, in <module>
exts = cythonize([Extension("cymem.cymem", ["cymem/cymem.pyx"])])
File "/usr/local/lib/python2.7/dist-packages/Cython/Distutils/extension.py", line 87, in __init__
**kw)
TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/cymem
I've had some problems converting certain token attributes to numpy arrays. From what I understand, extracting TAG attributes should work as well as extracting POS attributes, but it doesn't. Following is minimal demonstration.
Also, it would be great if dependency types (and perhaps even token.head indices) could be extracted using the same API.
import spacy.en
import numpy as np
nlp = spacy.en.English()
toks = nlp(u"This is a simple sentence.", True, True)
print "Extracting google POS"
print np.array(toks.to_array([spacy.en.attrs.POS]))
print "Extracting detailed TAG doesn't"
print np.array(toks.to_array([spacy.en.attrs.TAG]))
print "Even though the detailed TAG is detected"
print [t.tag for t in toks]
I had to look to .travis.yml to get that information. It's a great selling point, easy to mention ("NLP with Python 2/3 and Cython") and one of the first things we look at when considering Python libraries. Also, please don't only bury that in the docs. :)
System specs: Linux Mint 16, 64-bit, Python 3.4, using Anaconda
stack trace:
Traceback (most recent call last):
File "/blah/blah/test.py", line 324, in <module>
induceFail()
File "/blah/blah/test.py", line 319, in induceFail
print(l.orth_)
File "spacy/tokens.pyx", line 427, in spacy.tokens.Token.orth_.__get__ (spacy/tokens.cpp:8080)
File "spacy/strings.pyx", line 71, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1631)
IndexError: 3481370184
Alright, this is a weird one. It involves passing tokens around in different lists, and only certain configurations will reliably cause the error. This error cropped up in a dataset I am actually using, where I wanted to pass the token objects around but only in certain cases. I got around this error by begrudgingly just passing the string representation of the token object. Anyway, I've created a small set of code that reliably induces the error on my machine so you can hopefully debug it:
from spacy.en import English
def blah(toks):
''' Take the tokens from nlp(), append them to a list, return the list '''
lst = []
for tok in toks:
lst.append(tok)
# printing the tok.orth_ at this point works just fine
print(tok.orth_)
return lst
def induceFail():
nlp = English()
samples = ["a", "test blah wat okay"]
lst = []
for sample in samples:
# Go through all the samples, call nlp() on each to get tokens,
# pass those tokens to the blah() function, get a list back and put all results in another list
lst.extend(blah(nlp(sample)))
# go through the list of all tokens and try to print orth_
for l in lst:
# here is where the error is thrown
print(l.orth_)
induceFail()
Now, if you replace samples with the following sample, it works just fine!
samples = ["will this break", "test blah wat okay"]
And note that printing other attributes like pos_, dep_, and lower_ work without causing an IndexError, BUT don't print the correct thing, which leads me to believe some funny pointer dereferencing bug is causing this (the address is incorrect for certain symbols or something). It seems to only throw an error on the orth_ attribute. For example, I changed to printing lower_ and got the following output:
a
test
blah
wat
okay
a
test
blah
neighbour <-------------- Wait...what? Neighbour isn't in any of the samples...
okay
Notice neighbour? Where the heck did it get that?. This is in an entirely new Python process. I'm not doing any multiprocessing / multithreading. I'm not using multiple instances of the nlp() object. So somehow an address got messed up, it's getting into memory it's not supposed to, but it is a valid address and it prints what is there? I have no idea.
So then I ran it again. And this time, instead of "neighbour" it was "Milk".
I hope the error is reproducible on your end. I apologize if it isn't. I swear I'm not crazy, you might just need to change the values in the samples list.
As discussed in this issue: #32, printing lemmas of unicode words causes spaCy to crash. It was closed because I believe rsomeon did not want to contribute their patch, and instead preferred if you made your own changes. I don't believe the issue has been fixed, as I just tested it in v0.81 and I still get a crash.
from spacy.en import English
s = "Fiancé"
nlp = English()
tok = nlp(s)
print(tok[0].lemma_)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "spacy/tokens.pyx", line 585, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:10941)
File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1671)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5: unexpected end of data
Spacy installed fine with pip. Run the download command and...
➜ text-processing python -m spacy.en.download
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
mod_name, loader, code, fname = _get_module_details(mod_name)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
loader = get_loader(mod_name)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
return find_loader(fullname)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
for importer in iter_importers(fullname):
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
__import__(pkg)
File "/usr/local/lib/python2.7/site-packages/spacy/en/__init__.py", line 6, in <module>
from ..vocab import Vocab
File ".env/lib/python2.7/site-packages/Cython/Includes/numpy/__init__.pxd", line 861, in init spacy.vocab (spacy/vocab.cpp:9066)
ValueError: numpy.ufunc has the wrong size, try recompiling
Hey- compile (in fact, Cython translation) is failing due to uncommitted file (that should have been in the repo for 9 days). I think since this is now being advertised, you should probably move to off-site CI and tagged releases...
The parser used to correctly parse several example sentences that I have, but it is now incorrectly parsing them. I'm not sure when it stopped working, since I hadn't checked its output in a while. I am on python 3.4, spacy v0.83, fully updated data with "python -m spacy.en.download all"
Examples of errors:
This one is from your blog (https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/)
def printDeps(toks):
for tok in toks:
print(tok.orth_, tok.dep_, tok.pos_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])
toks = nlp("They ate the pizza with anchovies.")
printDeps(toks)
They SUB PRON [] []
ate VERB ['They'] ['pizza', 'with', '.'] <------- error "with" is connected to "ate"
the NMOD DET [] []
pizza OBJ NOUN ['the'] []
with VMOD ADP [] ['anchovies'] <------- error "with" is categorized as verb modifier
anchovies PMOD NOUN [] []
. P PUNCT [] []
toks = nlp("i don't have other assistance")
printDeps(toks)
i SUB NOUN [] []
do VERB ['i'] ["n't", 'have'] <---- Error "do"'s dep_ = "" and dep = 0
n't VMOD ADV [] []
have VC VERB [] ['assistance']
other NMOD ADJ [] []
assistance OBJ NOUN ['other'] []
toks = nlp("I have no other financial assistance available and he certainly won't provide support.")
printDeps(toks)
# add a comma and it works
toks = nlp("I have no other financial assistance available, and he certainly won't provide support.")
printDeps(toks)
I SUB PRON [] []
have VMOD VERB ['I'] ['available'] <------- Error, should have ['assistance'] in right deps
no NMOD DET [] []
other NMOD ADJ [] []
financial NMOD ADJ [] []
assistance SUB NOUN ['no', 'other', 'financial'] [] <----- Error, labeled as SUB not OBJ
available VMOD ADJ ['assistance'] [] <---- Error, labeled as VMOD rather than NMOD
and VMOD CONJ [] []
he SUB PRON [] []
certainly VMOD ADV [] []
wo VERB ['have', 'and', 'he', 'certainly'] ["n't", 'provide', '.'] <---- Error, missing dep_
n't VMOD ADV [] []
provide VC VERB [] ['support']
support OBJ NOUN [] []
. P PUNCT [] []
I SUB PRON [] []
have VMOD VERB ['I'] ['assistance']
no NMOD DET [] []
other NMOD ADJ [] []
financial NMOD ADJ [] []
assistance OBJ NOUN ['no', 'other', 'financial'] ['available']
available NMOD ADJ [] []
, P PUNCT [] []
and VMOD CONJ [] []
he SUB PRON [] []
certainly VMOD ADV [] []
wo VERB ['have', ',', 'and', 'he', 'certainly'] ["n't", 'provide', '.'] <--- Error, missing dep_
n't VMOD ADV [] []
provide VC VERB [] ['support']
support OBJ NOUN [] []
. P PUNCT [] []
Are there any plans of adding NER capabilities to Spacy soon? Any recommendations on the most modern techniques to do so, if not? (Eg, perhaps using the word vector representation?)
Steps to replicate:
The new data is written into the same temporary folder as the old data, leading to an shutil error. It should probably either (1) check if the right data is already available and use it instead redownloading it, or (2) overwrite the existing data. Maybe you can track versions using their MD5 hash or the spaCy version or something like that (e.g., installing downloading data into /tmp/spaCy/v0.7/
.
I'm on OS X 10.10.3, spaCy 0.70, Python 2.7.9.
I just followed the quickstart and the last two lines of the following is wrong:
>>> from __future__ import unicode_literals # If Python 2
>>> from spacy.en import English
>>> nlp = English()
>>> tokens = nlp(u'I ate the pizza with anchovies.')
>>> pizza = tokens[3]
>>> (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
... (14702, 'pizza', 14702, 'ate')
Current version outputs:
In [13]: (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
Out[13]: (14702, u'pizza', 669, u'eat')
Thanks!
Might you might allow for collapsed dependencies to easily be gotten from output of the dependency parser via some option?
No sure if this is a bug or a feature, but just thought I would note it:
>>>tokens = nlp(u"I see you.", tag=True, parse=False)
>>>tokens[0].lemma_
'-PRON-'
>>>tokens[2].lemma_
'-PRON-'
Shouldn't the lemmas be "I" and "you"?
Thanks for the nice work on spaCy!
-Cyrus
Under python 3.4, spaCy 0.33, printing tokens from the spaCy example:
import spacy.en
nlp = spacy.en.English()
tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
print(tokens)
Fails with:
Traceback (most recent call last):
File "t.py", line 7, in <module>
print(tokens)
File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
File "spacy/tokens.pyx", line 140, in spacy.tokens.Tokens.__str__ (spacy/tokens.cpp:4222)
... above repeats until ...
RuntimeError: maximum recursion depth exceeded while calling a Python object
I'd like to use spaCy as part of a command line utility that will run an analysis over a single document. The parsing and tagging is blazingly fast, which is great. But calling spacy.en.English()
takes over a second on my system, which is 10× too long for my purporses. Is there any hope for me?
When trying to install spaCy into my home directory with
pip install --user spacy
I run into a problem with MurmurHash. Are there perhaps some hard coded paths in that package? Here's what I get from pip:
Downloading spacy-0.40.tar.gz (24.3MB): 24.3MB downloaded
Running setup.py (path:/tmp/pip_build_patrick/spacy/setup.py) egg_info for package spacy
zip_safe flag not set; analyzing archive contents...
headers_workaround.__init__: module references __file__
Installed /tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_patrick/spacy/setup.py", line 138, in <module>
main(MOD_NAMES, use_cython)
File "/tmp/pip_build_patrick/spacy/setup.py", line 125, in main
run_setup(exts)
File "/tmp/pip_build_patrick/spacy/setup.py", line 113, in run_setup
headers_workaround.install_headers('murmurhash')
File "/tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg/headers_workaround/__init__.py", line 31, in install_headers
shutil.copy(path.join(src_dir, filename), path.join(dest_dir, filename))
File "/usr/lib/python2.7/shutil.py", line 119, in copy
copyfile(src, dst)
File "/usr/lib/python2.7/shutil.py", line 83, in copyfile
with open(dst, 'wb') as fdst:
IOError: [Errno 13] Permission denied: '/usr/include/murmurhash/MurmurHash3.h'
Complete output from command python setup.py egg_info:
zip_safe flag not set; analyzing archive contents...
headers_workaround.__init__: module references __file__
Installed /tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg
running egg_info
creating pip-egg-info/spacy.egg-info
writing requirements to pip-egg-info/spacy.egg-info/requires.txt
writing pip-egg-info/spacy.egg-info/PKG-INFO
writing top-level names to pip-egg-info/spacy.egg-info/top_level.txt
writing dependency_links to pip-egg-info/spacy.egg-info/dependency_links.txt
writing manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'
warning: manifest_maker: standard file '-c' not found
reading manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'pip-egg-info/spacy.egg-info/SOURCES.txt'
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_patrick/spacy/setup.py", line 138, in <module>
main(MOD_NAMES, use_cython)
File "/tmp/pip_build_patrick/spacy/setup.py", line 125, in main
run_setup(exts)
File "/tmp/pip_build_patrick/spacy/setup.py", line 113, in run_setup
headers_workaround.install_headers('murmurhash')
File "/tmp/pip_build_patrick/spacy/headers_workaround-0.17-py2.7.egg/headers_workaround/__init__.py", line 31, in install_headers
shutil.copy(path.join(src_dir, filename), path.join(dest_dir, filename))
File "/usr/lib/python2.7/shutil.py", line 119, in copy
copyfile(src, dst)
File "/usr/lib/python2.7/shutil.py", line 83, in copyfile
with open(dst, 'wb') as fdst:
IOError: [Errno 13] Permission denied: '/usr/include/murmurhash/MurmurHash3.h'
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_patrick/spacy
Storing debug log for failure in /home/patrick/.pip/pip.log
Could you provide some insights on training custom data sets for spaCy? Looking at the code in vocab.pyx, it makes me think that if not already, it should be easy to load the google word2vec data through spacy? (I'm not very familiar with that format just yet).
..
string_id = self.strings[chars[:word_len]]
while string_id >= vectors.size():
vectors.push_back(EMPTY_VEC)
assert vec != NULL
vectors[string_id] = vec
..
Tried compiling this using pip install from the visual studio 2010 and 2013 commmand prompts but it fails due to a redefinition error from Murmurhash3.h
https://msdn.microsoft.com/en-us/library/x7t11wke.aspx
SpaCy maintains a global mapping of strings to integers. Currently the integer-value is named "foo", and the string value is named "foo_".
I now think I prefer to have the string value named "foo", and the integer value named "foo_". I had convinced myself that it should be possible to avoid using the string attributes almost entirely, but in my own use, I find myself needing these attributes a lot.
What do you think? Should I keep token.orth as the integer ID, or should I move the integer IDs to the underscored attributes?
I've noticed that the dependency relation types that I get from the dependency parser in spaCy are a bit different than the Stanford dependencies. Is there some kind of mapping between the these dependency types and the Stanford dependencies?
I am building up a Dockerfile which includes spacy based on the docker/scipyserver.
The ipython developers support both python 2 and 3, so I figured I'd follow along and install spacy into both.
However, I have found that only the install that goes last actually sticks:
# install requirements
RUN pip install spacy && pip3 install spacy
# downloads a bunch of data
RUN python -m spacy.en.download && python3 -m spacy.en.download
This will always fail, as the python2 is no longer importable. Or more simply:
# install requirements
RUN pip install spacy && pip3 install spacy
RUN python -c "import spacy"
For now, I'm getting away with just supporting python3, but this might be something worth looking into.
Thanks for all the hard work!
When I follow the quick-start install steps, lexeme data exists at /usr/local/lib/python2.7/dist-packages/spacy/en/data/vocab/lexemes.bin, but somehow it isn't being loaded.
import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob #=> 0.0
nlp.vocab[u'not'].prob #=> 0.0
I've also tried loading manually with:
nlp.vocab.load_lexemes("/usr/local/lib/python2.7/dist-packages/spacy/en/data/vocab/lexemes.bin")
I just upgraded to 0.70, and when I try creating an instance of spacy.en.English, like this:
import spacy.en
nlp = spacy.en.English()
ValueError Traceback (most recent call last)
in ()
----> 1 nlp = spacy.en.English()
//anaconda/lib/python2.7/site-packages/spacy/en/init.pyc in init(self, data_dir)
74 else:
75 tok_data_dir = path.join(data_dir, 'tokenizer')
---> 76 tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
77 prefix_re = re.compile(prefix_re)
78 suffix_re = re.compile(suffix_re)
//anaconda/lib/python2.7/site-packages/spacy/util.pyc in read_lang_data(data_dir)
14 def read_lang_data(data_dir):
15 with open(path.join(data_dir, 'specials.json')) as file_:
---> 16 tokenization = json.load(file_)
17 prefix = read_prefix(data_dir)
18 suffix = read_suffix(data_dir)
//anaconda/lib/python2.7/json/init.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, *_kw)
288 parse_float=parse_float, parse_int=parse_int,
289 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook,
--> 290 *_kw)
291
292
//anaconda/lib/python2.7/json/init.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 parse_int is None and parse_float is None and
337 parse_constant is None and object_pairs_hook is None and not kw):
--> 338 return _default_decoder.decode(s)
339 if cls is None:
340 cls = JSONDecoder
//anaconda/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
364
365 """
--> 366 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
367 end = _w(s, end).end()
368 if end != len(s):
//anaconda/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
380 """
381 try:
--> 382 obj, end = self.scan_once(s, idx)
383 except StopIteration:
384 raise ValueError("No JSON object could be decoded")
ValueError: Expecting property name: line 567 column 1 (char 15406)
spaCy correctly tokenizes capital "I" + contraction ('d, 'm, 'll, 've) e.g.:
from spacy.en import English
nlp = English()
tok = nlp("I'm")
print([x.lower_ for x in tok])
>>> ['i', "'m"]
but when the "I" is a lowercase ("i") it does not tokenize into two tokens:
from spacy.en import English
nlp = English()
tok = nlp("i'm")
print([x.lower_ for x in tok])
>>> ["i'm"]
Not a big deal, and this may be the intent, since we don't know if the user meant capital "I", but I can't think of any problems that would happen if it tokenized the lowercase version into two.
I have a question/suggestion re tokenizer class and custom tokenizers.
I think it would be great to have the ability to have a custom split besides spaces and also to include new lines. Here are a couple of examples where this is an issue:
In [19]: tokens = nlp("I like green,blue and purple:)")
In [20]: for t in tokens:
print('|'+t.string+'|', t.pos_)
....:
|I | PRON
|like | VERB
|green,blue | ADJ
|and | CONJ
|purple| ADJ
|:| PUNCT
|)| PUNCT
and
In [21]: tokens = nlp("I like:\ngreen\nblue\npurple\n:)")
In [22]: for t in tokens:
print('|'+t.string+'|', t.pos_)
....:
|I | PRON
|like| VERB
|:| PUNCT
|
| ADV
|green| ADJ
|
| ADJ
|blue| ADJ
|
| ADJ
|purple| ADJ
|
| NOUN
|:)| PUNCT
Ideally we would retrain on a dataset that has new lines in it without stripping those and then label the new lines as such. Also, in "online writing" in many cases people tend to skip spaces when using punctuation. I am not sure if there are already any pre-tagged datasets where this is the case, but it would help a lot.
So the question is: what would be the easiest way to integrate these into existing code? So far i'm doing a workaround where I insert spaces if there isn't one already after a comma, but it feels dirty, and I'm not sure if I should be replacing new lines by spaces, because a good amount of information is lost in foregoing the distinction.
Thanks for the library by the way.
from spacy.en import English()
nlp = English()
nlp(u'me…')[0].lemma_
results in an exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 2: unexpected end of data
The example of boolean features (is_alpha, is_digit, is_lower, is_title, is_upper, is_ascii, is_punct, like_url, like_num)
shown on the quickstart page doesn't work for me:
>>> lexeme = nlp.vocab[u'Apple']
>>> lexeme.is_alpha, is_upper
True, False
>>> tokens = nlp('Apple computers')
>>> tokens[0].is_alpha, tokens[0].is_upper
>>> True, False
>>> from spact.en.attrs import IS_ALPHA, IS_UPPER
>>> tokens.to_array((IS_ALPHA, IS_UPPER))[0]
array([1, 0])
For example, I'm getting: AttributeError: 'spacy.tokens.Token' object has no attribute 'is_alpha'
Instead, I'm calling from spacy.orth import *
and then calling, for example, is_punct(token)
. There is also the inconsistency of spacy.orth.like_number
as like_num
listed under boolean features in the quickstart page.
Invocation of spacy.en.English tokenizer results in an AttributeError. Fresh install of spaCy 0.8.2 from PyPi.
Abbreviated stack trace:
File "/python2.7/site-packages/spacy/en/__init__.py", line 191, in __call__
self.parser(tokens)
File "/python2.7/site-packages/spacy/en/__init__.py", line 191, in __call__
File "/python2.7/site-packages/spacy/en/__init__.py", line 117, in parser
self.parser(tokens)
self.ParserTransitionSystem)
File "/python2.7/site-packages/spacy/en/__init__.py", line 117, in parser
File "spacy/syntax/parser.pyx", line 74, in spacy.syntax.parser.GreedyParser.__init__ (spacy/syntax/parser.cpp:4120)
self.ParserTransitionSystem)
AttributeError: 'Config' object has no attribute 'labels'
File "spacy/syntax/parser.pyx", line 74, in spacy.syntax.parser.GreedyParser.__init__ (spacy/syntax/parser.cpp:4120)
AttributeError: 'Config' object has no attribute 'labels'
If you parse a sentence like "to walk, do foo", the .idx for tokens[2] is 4 rather than the expected 7. Strangely this only seems to happen when the word before the punctuation mark is more than three characters long, and when that word is not the first word in the sentence (so e.g. "hello, world" works fine).
In the new multi_words.py RegexMerger, there is a reference to unicode(tokens). In Python 3, there is no "unicode()" function, so it causes spaCy to crash.
from spacy.en import English
nlp = English()
tok = nlp("test test test")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spacy/en/__init__.py", line 195, in __call__
self.mwe_merger(tokens)
File "/spacy/multi_words.py", line 7, in __call__
for m in regex.finditer(unicode(tokens)):
NameError: name 'unicode' is not defined
Trying to run the given NER test case, I get no entities in the list. Additionally, if I try making up a test case I always get 0 entities in the returned list. I have downloaded the latest files using
python -m spacy.en.download
and am using v0.82 on Python 3.4
def test_simple_types():
tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
ents = list(tokens.ents)
assert ents[0].start == 1
assert ents[0].end == 2
assert ents[0].label_ == 'PERSON'
assert ents[1].start == 4
assert ents[1].end == 6
assert ents[1].label_ == 'GPE'
assert ents[2].start == 7
assert ents[2].end == 8
assert ents[2].label_ == 'DATE'
assert ents[3].start == 8
assert ents[3].end == 9
assert ents[3].label_ == 'TIME'
assert ents[0].start == 1
IndexError: list index out of range
I am finding it a little difficult to traverse the dependency parser output. I am using English(u"some text", True, True) to do the parsing. From the tokens output, there is no sibling() method on the tokens as described in the documents. From a token I can get the head, but pulling the children seems a little buggy. It doesn't always have all the children if I compute what they are from the head's alone. If I parse "the increasing levels of acidity bleached the coral", of has a head "levels" but levels doesn't have "of" in it's children. Also, when enumerating child(0) by incrementing the index value, when you've run out of children, it keeps outputting the last child rather than null, per se. Great repo overall, looking forward to using it more.
Is there a method you would prefer for individuals to add slang to parts of speech library?
Would you allow for this/ how would I go about doing it?
i.e do contractions like 'sup?' get reduced to a verb or a noun?
Both pip install and installing from source end "successfully," but
python -m spacy.en.download
as well as
from spacy.en import English
throw an ImportError
5 from .. import orth
----> 6 from ..vocab import Vocab
7 from ..tokenizer import Tokenizer
8 from ..syntax.parser import GreedyParser
ImportError: dlopen(/Users/dk/anaconda/lib/python2.7/site-packages/spacy/vocab.so, 2): Symbol not found: __ZSt20__throw_length_errorPKc
Referenced from: /Users/dk/anaconda/lib/python2.7/site-packages/spacy/vocab.so
Expected in: dynamic lookup
I am using the conda dist of python on Mac 10.10
Python 2.7.8 |Anaconda 2.1.0 (x86_64)| (default, Aug 21 2014, 15:21:46)
and installed spaCy 0.40
This seems like a build issue, but not errors occur during... Also, I was able to install everything properly on a Ubuntu both with the same conda python dist.
If you have any ideas, that would be much appreciated.
In [1]: import spacy.en
In [2]: nlp = spacy.en.English()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-c68587acddc5> in <module>()
----> 1 nlp = spacy.en.English()
/usr/local/lib/python3.4/site-packages/spacy/en/__init__.py in __init__(self, data_dir)
74 else:
75 tok_data_dir = path.join(data_dir, 'tokenizer')
---> 76 tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
77 prefix_re = re.compile(prefix_re)
78 suffix_re = re.compile(suffix_re)
/usr/local/lib/python3.4/site-packages/spacy/util.py in read_lang_data(data_dir)
14 def read_lang_data(data_dir):
15 with open(path.join(data_dir, 'specials.json')) as file_:
---> 16 tokenization = json.load(file_)
17 prefix = read_prefix(data_dir)
18 suffix = read_suffix(data_dir)
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
266 cls=cls, object_hook=object_hook,
267 parse_float=parse_float, parse_int=parse_int,
--> 268 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
269
270
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
316 parse_int is None and parse_float is None and
317 parse_constant is None and object_pairs_hook is None and not kw):
--> 318 return _default_decoder.decode(s)
319 if cls is None:
320 cls = JSONDecoder
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py in decode(self, s, _w)
341
342 """
--> 343 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
344 end = _w(s, end).end()
345 if end != len(s):
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py in raw_decode(self, s, idx)
357 """
358 try:
--> 359 obj, end = self.scan_once(s, idx)
360 except StopIteration as err:
361 raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting property name enclosed in double quotes: line 567 column 1 (char 15406)
I happened to stumble over this while parsing a large dataset: spaCy throws an AssertionError when trying to parse an empty string.
Minimal Example:
from spacy.en import English
nlp = English()
nlp(u"")
results in
Traceback (most recent call last):
File "test.py", line 3, in <module>
nlp(u"")
File "/usr/local/lib/python2.7/dist-packages/spacy/en/__init__.py", line 149, in __call__
self.parser(tokens)
File "spacy/syntax/parser.pyx", line 77, in spacy.syntax.parser.GreedyParser.__call__ (spacy/syntax/parser.cpp:4122)
File "spacy/syntax/_state.pyx", line 128, in spacy.syntax._state.init_state (spacy/syntax/_state.cpp:2715)
File "spacy/syntax/_state.pyx", line 33, in spacy.syntax._state.push_stack (spacy/syntax/_state.cpp:1855)
AssertionError
The token "didn't" is correctly separated into ["did", "n't"], but in this case the lemma for "did" does not correctly register as "do", instead it is an empty string.
tokens = nlp("didn't")
print(tokens[0].lemma_)
>>> empty string
However it works when the token is just "did"
tokens = nlp("did")
print(tokens[0].lemma_)
>>> do
And "isn't" works perfectly, correctly being split as ["is", "n't"] with the lemma "be" for token[0]
tokens = nlp("isn't")
print(tokens[0].lemma_)
>>> be
I was running spacy through some sentences and saw that the sentence below is throwing a ValuerError.
from spacy.en import English
spacy_nlp = English()
text = u'Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1).'
tokens = spacy_nlp(text)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/spacy/en/init.py", line 195, in call
self.mwe_merger(tokens)
File "/usr/local/lib/python2.7/site-packages/spacy/multi_words.py", line 8, in call
tokens.merge(m.start(), m.end(), tag, m.group(), entity_type)
File "spacy/tokens.pyx", line 329, in spacy.tokens.Tokens.merge (spacy/tokens.cpp:6701)
ValueError: max() arg is an empty sequence
I have the newest spacy installed and up to date requirements.
I'm getting
TypeError: unsupported operand type(s) for *: 'spacy.lexeme.Lexeme' and 'spacy.tokens.Token'
I'm running example code in iPython notebook.
I suspect it has something to do with a multiplication by 0/null etc.(?) from the empty vector array, which should have values:
In [6]:
pleaded.repvec[:5]
Out[6]:
array([ 0., 0., 0., 0., 0.], dtype=float32)
full code and errors:
In [1]:
import spacy.en
from spacy.parts_of_speech import ADV
nlp = spacy.en.English()
In [2]:
# Load the pipeline, and call it with some text.
s = "'Give it back,' he pleaded abjectly, 'it’s mine.'"
s1 = s.decode('utf-8')
probs = [lex.prob for lex in nlp.vocab]
probs.sort()
is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
tokens = nlp(s1)
print(''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens))
'Give it back,' he pleaded ABJECTLY, 'it’s mine.'
In [3]:
b = 'back'
s2 = b.decode('utf-8')
nlp.vocab[s2].prob
Out[3]:
-7.403977394104004
In [4]:
pleaded = tokens[8]
In [5]:
pleaded.repvec.shape
Out[5]:
(300,)
In [6]:
pleaded.repvec[:5]
Out[6]:
array([ 0., 0., 0., 0., 0.], dtype=float32)
In [8]:
from numpy import dot
from numpy.linalg import norm
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
words = [w for w in nlp.vocab if w.lower]
words.sort(key=lambda w: cosine(w, pleaded))
words.reverse()
#print('1-20', ', '.join(w.orth_ for w in words[0:20]))
#print('50-60', ', '.join(w.orth_ for w in words[50:60]))
#print('100-110', ', '.join(w.orth_ for w in words[100:110]))
#print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
#print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-3dfcfec488f6> in <module>()
3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
4 words = [w for w in nlp.vocab if w.lower]
----> 5 words.sort(key=lambda w: cosine(w, pleaded))
6 words.reverse()
7
<ipython-input-8-3dfcfec488f6> in <lambda>(w)
3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
4 words = [w for w in nlp.vocab if w.lower]
----> 5 words.sort(key=lambda w: cosine(w, pleaded))
6 words.reverse()
7
<ipython-input-8-3dfcfec488f6> in <lambda>(v1, v2)
1 from numpy import dot
2 from numpy.linalg import norm
----> 3 cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
4 words = [w for w in nlp.vocab if w.lower]
5 words.sort(key=lambda w: cosine(w, pleaded))
TypeError: unsupported operand type(s) for *: 'spacy.lexeme.Lexeme' and 'spacy.tokens.Token'
It seems that spaCy
counts all pronouns as normal nouns.
import spacy.en
nlp = spacy.en.English()
In [34]: for tok in nlp(u"You and I make us"):
....: print tok.string, tok.pos
....:
You 6
and 4
I 6
make 10
us 6
In [35]: from spacy.parts_of_speech import PRON, NOUN
In [36]: NOUN, PRON
Out[36]: (6, 8)
I created a minimal script:
import spacy.en
nlp = spacy.en.English()
tokens = nlp(u"The cow jumped over the moon.", tag=True, parse=False)
And then ran it under Python 2.7.9, spaCy 0.70, OS X 10.10.3.
It crashes, giving the following error:
❯ python test_spacy.py
Traceback (most recent call last):
File "test_spacy.py", line 2, in <module>
nlp = spacy.en.English()
File "/Users/jordansuchow/.virtualenvs/spacy/lib/python2.7/site-packages/spacy/en/__init__.py", line 76, in __init__
tok_rules, prefix_re, suffix_re, infix_re = read_lang_data(tok_data_dir)
File "/Users/jordansuchow/.virtualenvs/spacy/lib/python2.7/site-packages/spacy/util.py", line 16, in read_lang_data
tokenization = json.load(file_)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 290, in load
**kw)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 567 column 1 (char 15406)
An example.
toks = nlp(u"foobar")
print [i.string for i in toks]
>> [u'']
toks = nlp(u"foo bar")
print [i.string for i in toks]
>> [u'foo ', u'bar']
Hi Matthew,
Tried master
branch earlier today, the requirements.txt
is missing thinc
and setup.py
couldn't find murmurhash
headers, so I have to copy the murmurhash
import instruction from thinc
.
Also missing humanize
, unidecode
, ujson
modules for testing.
How can I add another language?
It would be great to have support for collapsed dependencies similar to Stanford CoreNLP. For example in the sentence:
"I am moving to Florida"
"Florida" and "moving" aren't directly related because of the "to" particle(florida.head
is to
). I think the API can work like this:
>>> florida.dependencies(moving)
['prep_to']
Then one could do:
if 'prep_to' in florida.dependencies(moving):
...
I've bumped into an issue where sentence segmentation (as given in Tokens.sents) doesn't match parse trees in that Tokens.sents do not separate independent parse trees. This behavior can be observed on the following text:
"It was a mere past six when we left Baker Street, and it still wanted ten minutes to the hour when we found ourselves in Serpentine Avenue."
I assume this is due to sentence segmentation being done by a separate classifier, so I wouldn't call it a bug, but it can be a usage problem, so I am reporting it. A few examples that I've checked manually indicate that parse-trees give better sentence segmentation then whatever Tokens.sents is based on.
My current workaround idea is to follow each token's dependancy tree path all the way up to the root, and then using the obtained root-node-array as sentence labels. This is however crude, ugly and inefficient, it would be nice to have a better solution.
I see spaCy has an NER now. Very nice. I'm curious about how it compares to other NER systems. Have you benchmarked it on a standard dataset? What algorithm are you using? How does it compare to MITIE and Stanford?
It would be great to be able to load pretrained word representations, at least the ones from Glove[1] and word2vec [2].
In the same fashion it would be useful to have a most_similar
method able to efficiently retrieve the top n similar words.
[1] http://nlp.stanford.edu/projects/glove/
[2] https://code.google.com/p/word2vec/
Can you provide an example of how to use spaCy for sentence segmentation? Thanks so much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.