Giter VIP home page Giter VIP logo

Comments (13)

honnibal avatar honnibal commented on April 28, 2024

Hmm.

I've never tested this without a virtualenv. I'd assumed it should load the same. I'll try to reproduce this now.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Yeah, it's working when I run it with the pyenv version, but not with the system Python. I've checked and the data's the same, and it's definitely getting loaded.

I suspect the most likely thing is that the way Python and spaCy are being compiled is making a difference. The code to load the binary data is rather brittle; I read the whole struct at once.

Can you run:

python -c "import distutils.sysconfig; print distutils.sysconfig.get_config_var('CONFIG_ARGS')

And tell me what comes back?

from spacy.

Ejhfast avatar Ejhfast commented on April 28, 2024

Sure, thanks!

'--enable-shared' '--prefix=/usr' '--enable-ipv6' '--enable-unicode=ucs4' '--with-dbmliborder=bdb:gdbm' '--with-system-expat' '--with-system-ffi' '--with-fpectl' 'CC=x86_64-linux-gnu-gcc' 'CFLAGS=-D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security ' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro'

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Right.

Yeah, I'm getting the same error with these flags:

'--enable-shared' '--prefix=/usr' '--enable-ipv6' '--enable-unicode=ucs4' '--with-dbmliborder=bdb:gdbm'  -with-system-expat' '--with-system-ffi' '--with-fpectl' 'CC=x86_64-linux-gnu-gcc' 'CFLAGS=-D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security ' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro'

And it's working with these flags, via pyenv:

'--prefix=/home/pixie/.pyenv/versions/2.7.8' '--libdir=/home/pixie/.pyenv/versions/2.7.8/lib' 'LDFLAGS=-L/home/pixie/.pyenv/versions/2.7.8/lib ' 'CPPFLAGS=-I/home/pixie/.pyenv/versions/2.7.8/include '

I'll look into this, thanks for the report.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Okay, the issue is with the --enable-unicode=ucs4 compile flag.

The lexemes.bin file currently contains the output of hashing unicode strings. On wide unicode builds these produce different values, so all string-lookups are failing.

The simple fix is to have the lexemes.bin file refer to indices into the strings.txt file. This way the hashes can be recomputed on the target platform. I should be able to roll this out fairly quickly.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Now fixed.

from spacy.

Ejhfast avatar Ejhfast commented on April 28, 2024

Awesome, thanks!

from spacy.

simonwiles avatar simonwiles commented on April 28, 2024

Are words missing frequency data likely to be related to the same issue?

import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob  #=> -3.078474521636963
nlp.vocab[u'not'].prob  #=> -5.407193660736084
nlp.vocab[u'Darwin'].prob  #=> -12.803827285766602
nlp.vocab[u'allele'].prob  #=> 0.0

Is there something I can do about this?

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Hi, sorry about the delay replying to this.

You're right that this part of the data is a bit broken. There were actually a few problems here:

  1. Out-of-vocabulary words get the probability 0. This is the most frustrating, as it's exactly wrong, given that the rest of the distribution is in log space, so these were sorting higher.
  2. The counts were based on newswire text (Gigaword corpus) which isn't great for most users.
  3. There was a bug in my smoothing code.

The upcoming release uses counts derived from the Reddit comment corpus --- 60 billion tokens. I've also taken care to fix the other associated bugs as well.

I hope to have a release candidate out soon. I'll send it your way for testing when I do.

from spacy.

simonwiles avatar simonwiles commented on April 28, 2024

That's great, thanks; I'll look forward to the RC.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Okay, v0.89 is now up on PyPI: http://spacy.io/updates.html . Give it a go!

from spacy.

simonwiles avatar simonwiles commented on April 28, 2024

Looks great, thanks!

import spacy.en
nlp = spacy.en.English()
nlp.vocab[u'the'].prob  #=> -3.528766632080078
nlp.vocab[u'not'].prob  #=> -5.332601070404053
nlp.vocab[u'Darwin'].prob  #=> -12.72145938873291
nlp.vocab[u'allele'].prob  #=> -14.982345581054688

from spacy.

lock avatar lock commented on April 28, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.