Giter VIP home page Giter VIP logo

Comments (10)

honnibal avatar honnibal commented on April 28, 2024 1

The next round of changes makes this much easier, but there are a number of ways you could achieve this currently.

The key method is tokenizer.tokens_from_list, which lets you specify the tokenization. The problem with this is that the pipeline is inherently coupled, because of the statistical models. (NLP pipelines always are. "Swappable components" is a lie, unless you retrain everything.)

Changing things from how the models were trained degrades performance.

If you have the training data, you can tokenize however you like. spaCy is able to take raw text, and compute a Levenshtein alignment to the tokenization in a treebank. Misalignments between its tokenizer and the tokens in the gold-standard are treated as ambiguous examples (multiple answers could be correct). See bin/parser/train.py for details of this.

If you're just changing a few things, it probably won't hurt the model much. For now I'd try something like this:

def replace_tokenizer(nlp, my_split_function):
    old_tokenizer = nlp.tokenizer 
    nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))

This wraps the tokenizer with a little function that steps in and provides your list of strings.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

I agree that it'd be better to be able to customize the tokenizer in various ways. Currently the tokenizer needs some refactoring.

I'll get back to you if I think of a neat way to handle this.

from spacy.

lqdc avatar lqdc commented on April 28, 2024

Thanks.

from spacy.

NSchrading avatar NSchrading commented on April 28, 2024

I didn't want to open an entirely new issue for this, as it's somewhat related. The tokenizer currently doesn't tokenize "i'm" (lowercased) into "i" and "'m" as it does with "I'm" ("I" and "'m").

from spacy.en import English
nlp = English()
s = "I'm a test. i'm a test."
toks = nlp(s)
for t in toks:
    print(t.orth_, t.lower_)
I i
'm 'm
a a
test test
. .
i'm i'm           <------------- inconsistent
a a
test test
. .

Edit: Actually this was an issue before (#26). It came back? (I'm using v0.88 with Python 3)
Edit2: And there's an issue with "im". The tokenization is "m" and then "'m"...

s = "im hungry"
toks = nlp(s)
for t in toks:
    print(t.orth_, t.lower_)
m m
'm 'm
hungry hungry

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Yeah unfortunately this was a regression. I've fixed it again, thanks. It's come from copying data files around.

The data file is here: https://github.com/honnibal/spaCy/blob/master/lang_data/en/specials.json

When the tokenizer sees the key string (e.g. im) it splits it into the specified list. Each list item can specify an orthographic form, lemma and POS tag (keyed F, L and P). Orthographic forms should jointly match the original string (so together we should get "im" from the forms of the two tokens, "i" and "m".) The lemma can be more arbitrary. In general the statistical models use the lemma for features; I try to avoid using the orth as much as possible.

The data file doesn't support anything like a regular expression. This was by design --- my idea was it was actually simpler and better to be generating the file with an expression, and then using the simpler list in the code. But, I never ended up actually writing the thing that would ensure different capitalizations were automatically included in the specials.json.

A good solution to this would be a welcome contribution.

from spacy.

NSchrading avatar NSchrading commented on April 28, 2024

So to be clear, you eventually want to have a script that can be run that generates the "specials.json" file, and takes care of always including things like lowercase versions, versions without the apostrophe, etc? That sounds relatively easy (although I'm sure lots of special cases would have to be taken care of). If I get some free time I'll look into implementing that. Do you mind if it is written in Python? I have no experience with Cython.

And heads up: the tokens "i'm", "i'd", "i'll", "i've" still aren't split correctly using the most recent version of the data file you linked. If it is uppercase it works, but not lowercase.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Yeah the script should be in Python. Thanks for the note on those words. I'll add some tests, too.

from spacy.

magnusnissel avatar magnusnissel commented on April 28, 2024

The issue also seems to affect hyphenation (unless it's my wrapper code messing things up) so that a

A breath-taking view
turns into
A_DT breath_NN -_HYPH taking_VBG view_NN

while

A breathtaking view
turns into
A_DT breathtaking_JJ view_NN

So far, I really like spaCy and the speed is impressive. Custom word tokenization would be really helpful for linguists. I usually use re.split(r"([^0-9A-Za-z-'_])", str) for that purpose and it would be very cool if I could simply feed the resulting list of strings to spaCY for part-of-speech tagging.

from spacy.

magnusnissel avatar magnusnissel commented on April 28, 2024

Ah, that makes a lot of sense. Thank you.

from spacy.

lock avatar lock commented on April 28, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.