Comments (13)
Hmm.
I've never tested this without a virtualenv. I'd assumed it should load the same. I'll try to reproduce this now.
from spacy.
Yeah, it's working when I run it with the pyenv version, but not with the system Python. I've checked and the data's the same, and it's definitely getting loaded.
I suspect the most likely thing is that the way Python and spaCy are being compiled is making a difference. The code to load the binary data is rather brittle; I read the whole struct at once.
Can you run:
python -c "import distutils.sysconfig; print distutils.sysconfig.get_config_var('CONFIG_ARGS')
And tell me what comes back?
from spacy.
Sure, thanks!
'--enable-shared' '--prefix=/usr' '--enable-ipv6' '--enable-unicode=ucs4' '--with-dbmliborder=bdb:gdbm' '--with-system-expat' '--with-system-ffi' '--with-fpectl' 'CC=x86_64-linux-gnu-gcc' 'CFLAGS=-D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security ' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro'
from spacy.
Right.
Yeah, I'm getting the same error with these flags:
'--enable-shared' '--prefix=/usr' '--enable-ipv6' '--enable-unicode=ucs4' '--with-dbmliborder=bdb:gdbm' -with-system-expat' '--with-system-ffi' '--with-fpectl' 'CC=x86_64-linux-gnu-gcc' 'CFLAGS=-D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security ' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro'
And it's working with these flags, via pyenv:
'--prefix=/home/pixie/.pyenv/versions/2.7.8' '--libdir=/home/pixie/.pyenv/versions/2.7.8/lib' 'LDFLAGS=-L/home/pixie/.pyenv/versions/2.7.8/lib ' 'CPPFLAGS=-I/home/pixie/.pyenv/versions/2.7.8/include '
I'll look into this, thanks for the report.
from spacy.
Okay, the issue is with the --enable-unicode=ucs4 compile flag.
The lexemes.bin file currently contains the output of hashing unicode strings. On wide unicode builds these produce different values, so all string-lookups are failing.
The simple fix is to have the lexemes.bin file refer to indices into the strings.txt file. This way the hashes can be recomputed on the target platform. I should be able to roll this out fairly quickly.
from spacy.
Now fixed.
from spacy.
Awesome, thanks!
from spacy.
Are words missing frequency data likely to be related to the same issue?
import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob #=> -3.078474521636963
nlp.vocab[u'not'].prob #=> -5.407193660736084
nlp.vocab[u'Darwin'].prob #=> -12.803827285766602
nlp.vocab[u'allele'].prob #=> 0.0
Is there something I can do about this?
from spacy.
Hi, sorry about the delay replying to this.
You're right that this part of the data is a bit broken. There were actually a few problems here:
- Out-of-vocabulary words get the probability 0. This is the most frustrating, as it's exactly wrong, given that the rest of the distribution is in log space, so these were sorting higher.
- The counts were based on newswire text (Gigaword corpus) which isn't great for most users.
- There was a bug in my smoothing code.
The upcoming release uses counts derived from the Reddit comment corpus --- 60 billion tokens. I've also taken care to fix the other associated bugs as well.
I hope to have a release candidate out soon. I'll send it your way for testing when I do.
from spacy.
That's great, thanks; I'll look forward to the RC.
from spacy.
Okay, v0.89 is now up on PyPI: http://spacy.io/updates.html . Give it a go!
from spacy.
Looks great, thanks!
import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob #=> -3.528766632080078
nlp.vocab[u'not'].prob #=> -5.332601070404053
nlp.vocab[u'Darwin'].prob #=> -12.72145938873291
nlp.vocab[u'allele'].prob #=> -14.982345581054688
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Install via `requirements.txt` documentation doesn't work HOT 17
- catalogue.RegistryError: [E892] Unknown function registry: 'vectors'. HOT 1
- invalid whitespace entity spans msg but no whitespace is there HOT 2
- Upgrade to spacy 3.7.2 throws Attribute error HOT 9
- Training spacy model stucking in 99% HOT 1
- displaCy: Separating Punctuations in Dependency Visualization HOT 1
- spaCy training stopping automatically in Google Colab
- Spacy-transformers - update transformers compatibility HOT 4
- NER component in en_core_web_trf doesn't depend on transformer HOT 1
- en_core_web_sm/md/lg stopped loading today (02/04/2024) HOT 1
- Custom component to split coordinations
- Fail to train openai-community / gpt2 model for custom NER on SpaCy framework HOT 1
- Summary HOT 1
- Sharding Warning HOT 1
- nlp.pipe() with multiple processes on Windows VSCode HOT 2
- `Spacy` has inconsistency when dividing sentences HOT 5
- Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model
- Enable override of existing custom pipe HOT 1
- Check that filter_spans input is a Span HOT 3
- Tokenizer Incorrectly Splitting "M1M" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.