I have a question/suggestion re tokenizer class and custom tokenizers. I think it

So to be clear, you eventually want to have a that can be run that generates th

Tokenizer splitting about spacy HOT 10 CLOSED

explosion commented on April 28, 2024

Tokenizer splitting

from spacy.

Comments (10)

honnibal commented on April 28, 2024 1

The next round of changes makes this much easier, but there are a number of ways you could achieve this currently.

The key method is tokenizer.tokens_from_list, which lets you specify the tokenization. The problem with this is that the pipeline is inherently coupled, because of the statistical models. (NLP pipelines always are. "Swappable components" is a lie, unless you retrain everything.)

Changing things from how the models were trained degrades performance.

If you have the training data, you can tokenize however you like. spaCy is able to take raw text, and compute a Levenshtein alignment to the tokenization in a treebank. Misalignments between its tokenizer and the tokens in the gold-standard are treated as ambiguous examples (multiple answers could be correct). See bin/parser/train.py for details of this.

If you're just changing a few things, it probably won't hurt the model much. For now I'd try something like this:

def replace_tokenizer(nlp, my_split_function):
    old_tokenizer = nlp.tokenizer 
    nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))

This wraps the tokenizer with a little function that steps in and provides your list of strings.

from spacy.

honnibal commented on April 28, 2024

I agree that it'd be better to be able to customize the tokenizer in various ways. Currently the tokenizer needs some refactoring.

I'll get back to you if I think of a neat way to handle this.

from spacy.

lqdc commented on April 28, 2024

Thanks.

from spacy.

NSchrading commented on April 28, 2024

I didn't want to open an entirely new issue for this, as it's somewhat related. The tokenizer currently doesn't tokenize "i'm" (lowercased) into "i" and "'m" as it does with "I'm" ("I" and "'m").

from spacy.en import English
nlp = English()
s = "I'm a test. i'm a test."
toks = nlp(s)
for t in toks:
    print(t.orth_, t.lower_)

I i
'm 'm
a a
test test
. .
i'm i'm           <------------- inconsistent
a a
test test
. .

Edit: Actually this was an issue before (#26). It came back? (I'm using v0.88 with Python 3)
Edit2: And there's an issue with "im". The tokenization is "m" and then "'m"...

s = "im hungry"
toks = nlp(s)
for t in toks:
    print(t.orth_, t.lower_)

m m
'm 'm
hungry hungry

from spacy.

honnibal commented on April 28, 2024

Yeah unfortunately this was a regression. I've fixed it again, thanks. It's come from copying data files around.

The data file is here: https://github.com/honnibal/spaCy/blob/master/lang_data/en/specials.json

When the tokenizer sees the key string (e.g. im) it splits it into the specified list. Each list item can specify an orthographic form, lemma and POS tag (keyed F, L and P). Orthographic forms should jointly match the original string (so together we should get "im" from the forms of the two tokens, "i" and "m".) The lemma can be more arbitrary. In general the statistical models use the lemma for features; I try to avoid using the orth as much as possible.

The data file doesn't support anything like a regular expression. This was by design --- my idea was it was actually simpler and better to be generating the file with an expression, and then using the simpler list in the code. But, I never ended up actually writing the thing that would ensure different capitalizations were automatically included in the specials.json.

A good solution to this would be a welcome contribution.

from spacy.

NSchrading commented on April 28, 2024

So to be clear, you eventually want to have a script that can be run that generates the "specials.json" file, and takes care of always including things like lowercase versions, versions without the apostrophe, etc? That sounds relatively easy (although I'm sure lots of special cases would have to be taken care of). If I get some free time I'll look into implementing that. Do you mind if it is written in Python? I have no experience with Cython.

And heads up: the tokens "i'm", "i'd", "i'll", "i've" still aren't split correctly using the most recent version of the data file you linked. If it is uppercase it works, but not lowercase.

from spacy.

honnibal commented on April 28, 2024

Yeah the script should be in Python. Thanks for the note on those words. I'll add some tests, too.

from spacy.

magnusnissel commented on April 28, 2024

The issue also seems to affect hyphenation (unless it's my wrapper code messing things up) so that a

A breath-taking view
turns into
A_DT breath_NN -_HYPH taking_VBG view_NN

while

A breathtaking view
turns into
A_DT breathtaking_JJ view_NN

So far, I really like spaCy and the speed is impressive. Custom word tokenization would be really helpful for linguists. I usually use re.split(r"([^0-9A-Za-z-'_])", str) for that purpose and it would be very cool if I could simply feed the resulting list of strings to spaCY for part-of-speech tagging.

from spacy.

magnusnissel commented on April 28, 2024

Ah, that makes a lot of sense. Thank you.

from spacy.

lock commented on April 28, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Tokenizer splitting about spacy HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent