Comments (10)
The next round of changes makes this much easier, but there are a number of ways you could achieve this currently.
The key method is tokenizer.tokens_from_list, which lets you specify the tokenization. The problem with this is that the pipeline is inherently coupled, because of the statistical models. (NLP pipelines always are. "Swappable components" is a lie, unless you retrain everything.)
Changing things from how the models were trained degrades performance.
If you have the training data, you can tokenize however you like. spaCy is able to take raw text, and compute a Levenshtein alignment to the tokenization in a treebank. Misalignments between its tokenizer and the tokens in the gold-standard are treated as ambiguous examples (multiple answers could be correct). See bin/parser/train.py for details of this.
If you're just changing a few things, it probably won't hurt the model much. For now I'd try something like this:
def replace_tokenizer(nlp, my_split_function):
old_tokenizer = nlp.tokenizer
nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))
This wraps the tokenizer with a little function that steps in and provides your list of strings.
from spacy.
I agree that it'd be better to be able to customize the tokenizer in various ways. Currently the tokenizer needs some refactoring.
I'll get back to you if I think of a neat way to handle this.
from spacy.
Thanks.
from spacy.
I didn't want to open an entirely new issue for this, as it's somewhat related. The tokenizer currently doesn't tokenize "i'm" (lowercased) into "i" and "'m" as it does with "I'm" ("I" and "'m").
from spacy.en import English
nlp = English()
s = "I'm a test. i'm a test."
toks = nlp(s)
for t in toks:
print(t.orth_, t.lower_)
I i
'm 'm
a a
test test
. .
i'm i'm <------------- inconsistent
a a
test test
. .
Edit: Actually this was an issue before (#26). It came back? (I'm using v0.88 with Python 3)
Edit2: And there's an issue with "im". The tokenization is "m" and then "'m"...
s = "im hungry"
toks = nlp(s)
for t in toks:
print(t.orth_, t.lower_)
m m
'm 'm
hungry hungry
from spacy.
Yeah unfortunately this was a regression. I've fixed it again, thanks. It's come from copying data files around.
The data file is here: https://github.com/honnibal/spaCy/blob/master/lang_data/en/specials.json
When the tokenizer sees the key string (e.g. im) it splits it into the specified list. Each list item can specify an orthographic form, lemma and POS tag (keyed F, L and P). Orthographic forms should jointly match the original string (so together we should get "im" from the forms of the two tokens, "i" and "m".) The lemma can be more arbitrary. In general the statistical models use the lemma for features; I try to avoid using the orth as much as possible.
The data file doesn't support anything like a regular expression. This was by design --- my idea was it was actually simpler and better to be generating the file with an expression, and then using the simpler list in the code. But, I never ended up actually writing the thing that would ensure different capitalizations were automatically included in the specials.json.
A good solution to this would be a welcome contribution.
from spacy.
So to be clear, you eventually want to have a script that can be run that generates the "specials.json" file, and takes care of always including things like lowercase versions, versions without the apostrophe, etc? That sounds relatively easy (although I'm sure lots of special cases would have to be taken care of). If I get some free time I'll look into implementing that. Do you mind if it is written in Python? I have no experience with Cython.
And heads up: the tokens "i'm", "i'd", "i'll", "i've" still aren't split correctly using the most recent version of the data file you linked. If it is uppercase it works, but not lowercase.
from spacy.
Yeah the script should be in Python. Thanks for the note on those words. I'll add some tests, too.
from spacy.
The issue also seems to affect hyphenation (unless it's my wrapper code messing things up) so that a
A breath-taking view
turns into
A_DT breath_NN -_HYPH taking_VBG view_NN
while
A breathtaking view
turns into
A_DT breathtaking_JJ view_NN
So far, I really like spaCy and the speed is impressive. Custom word tokenization would be really helpful for linguists. I usually use re.split(r"([^0-9A-Za-z-'_])", str) for that purpose and it would be very cool if I could simply feed the resulting list of strings to spaCY for part-of-speech tagging.
from spacy.
Ah, that makes a lot of sense. Thank you.
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model
- Enable override of existing custom pipe HOT 1
- Check that filter_spans input is a Span HOT 3
- Tokenizer Incorrectly Splitting "M1M" HOT 1
- Version incompatibility between Spacy, Cuda, Pytorch and Python HOT 3
- Accessing private transformer models HOT 1
- Problems converting Doc object to/from json HOT 1
- The word transitions to the wrong prototype HOT 1
- Fuzzy Matching not working HOT 1
- Unable to finetune transformer based ner model after initial tuning
- Undesired whitespace normalization of Korean text
- Suggestion: Normalize or Translate the parsing labels for German and English dependency labelling
- Code example discrepancy for `Span.lemma_` in API docs HOT 1
- Signature docs error in API docs for `MorphAnalysis.__contains__` HOT 2
- Import broken python 3.9 HOT 1
- Luminous
- Converting into exe file through pyinstaller-> spacy cannot find factory for 'curated transformer' HOT 1
- Spacy problem with whitespace or punctuation HOT 1
- config.cfg error from spacy init config command
- Possible ORG misidentification HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.