Comments (23)
Hey,
It's still about 3-4% less accurate than Stanford and MITE, on OntoNotes
and CoNLL 03 It's not an exact replication of anything, but I guess you
could say it's using something like SEARN.
The plan is to switch to using DBPedia as a gazetteer, for entity linking.
On Wednesday, April 29, 2015, Andrew Nystrom [email protected]
wrote:
I see spaCy has an NER now. Very nice. I'm curious about how it compares
to other NER systems. Have you benchmarked it on a standard dataset? What
algorithm are you using? How does it compare to MITIE and Stanford?—
Reply to this email directly or view it on GitHub
#62.
from spacy.
In more detail:
- Uses same code as the shift/reduce parser, with a transition system based on the BILOU tagging scheme. Instead of coding the problem as a tagging task, we maintain a stack, with the transitions B, I, L, U and O, and constrain the actions so that {I, L} are only valid while an entity is on the stack.
- At the moment, no syntactic features are being used. But this scheme will make it easy to do joint parsing and NER.
- Ships model trained on OntoNotes 5
- On OntoNotes WSJ, scores ~82% evaluating all entity types, reportedly Stanford gets 85%
- On CoNLL '03, scores ~86% on CoNLL '03, reportedly Stanford gets 90% there. MITE is reporting 88%.
I believe that the addition of syntactic features and gazetteer will raise performance to around the state-of-the-art on these evaluations. However, I don't really buy these data sets as good benchmarks.
I believe these evaluations under-estimate the importance of gazetteers for real-world performance. My plan is to hash the DBPedia entities and hold them in memory. By designing the data structures carefully, and doing a bit of pruning, I think I can do entity linking entirely in-memory. Currently systems rely on DB queries, which imposes a lot of extra complexity.
from spacy.
Thank you very much for the fantastic response. I've found MITIE often does some unexpected things, like usually tagging Amazon as a location, or McDonald's as a person. Would the gazetteer help with this?
from spacy.
Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location, or
McDonald's as a person. Would the gazetteer help with this?Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.
The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.
—
Reply to this email directly or view it on GitHub
#62 (comment).
from spacy.
Sounds awesome. Are you thinking bloom filters and count min sketches?
On Mon, May 4, 2015 at 6:41 PM Matthew Honnibal [email protected]
wrote:
Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location,
or
McDonald's as a person. Would the gazetteer help with this?Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.—
Reply to this email directly or view it on GitHub
#62 (comment).—
Reply to this email directly or view it on GitHub
#62 (comment).
from spacy.
The DBPedia gazetteer will be great addition.
from spacy.
I'm thinking that won't be necessary.
DBPedia has about 4 million entities, we can probably prune away 25-50% of them, as there's probably a long tail. And we probably want to store 2-3 aliases per entity.
So, we want to store 12 million 64 bit hashes --- that's only 100mb! No need to do anything special.
We'll also want a per-word gazetteer, that asks "Does this word begin an entity of category X?". But this is a boolean value, and there's still plenty of room in the lexicon's bit vector --- I think I have about 30 values free. So, this won't take any extra memory at all.
from spacy.
Do you think a gazetteer is better than just manually adding modern training data to the mix?
from spacy.
Hi. I'm seeing quite a few situations where the NER will not tag properly when an entire sentence or phrase does not have any capital letters. Admittedly, this is a problem across the board with most state-of-the-art taggers last I checked. However, Alan Ritter did some work on this topic about a year ago:
https://github.com/aritter/twitter_nlp
https://aritter.github.io/twitter_ner.pdf
tl;dr He created a tagger that performed reasonably well on noisy Twitter text.
Do you have any plans to add support for robust NER in noisy text? Alternatively, do you plan to add the ability to slot-in other NER modules when necessary?
from spacy.
I am also seeing situations where a sentence gets wrongly tagged (and parsed) because of wrong capitalization and would also be interested in this use case (twitter, noisy text).
from spacy.
Actually, I wrote a fairly naive NER for Tweets a few months back, and wouldn't mind rewriting and updating it for this project. Of course, this assumes you are A.) looking for contributors, and B.) willing to wait a month or so while I finish off a couple other projects.
from spacy.
Hi,
I should be rolling out a new model with more robust training within a
week. The new model still lacks a gazetteer, but at least it's trained on
better data. A gazetteer is the nex step after that.
On Sat, May 30, 2015 at 9:45 AM, Derek Duoba [email protected]
wrote:
Actually, I wrote a fairly naive NER for Tweets a few months back, and
wouldn't mind rewriting and updating it for this project. Of course, this
assumes you are A.) looking for contributors, and B.) willing to wait a
month or so while I finish off a couple other projects.—
Reply to this email directly or view it on GitHub
#62 (comment).
from spacy.
Just pushed version 0.85. The NER should be a bit more robust, although it's still not great.
I'm working on various fixes. One idea is to add corruption to the training data, e.g. swap casing etc. I've always noticed this was an effective trick they use in ASR and OCR, and thought it'd be good to put it in an NLP model. Initial results are promising.
Still working on the gazetteer.
from spacy.
wrt the gazetteer, I think it's great that it's based on DBpedia, and that's a feature we're really looking forward as we already use this dataset. However, will it be possible to easily extend the gazetteer with our own lists? For example we would like to link to restaurants names.
from spacy.
Yes, definitely.
I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.
On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <[email protected]
wrote:
wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.—
Reply to this email directly or view it on GitHub
#62 (comment).
from spacy.
This sounds fantastic. Can't wait to see how it performs.
On 16 June 2015 at 14:52, Matthew Honnibal [email protected] wrote:
Yes, definitely.
I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <
[email protected]wrote:
wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.—
Reply to this email directly or view it on GitHub
#62 (comment).—
Reply to this email directly or view it on GitHub
#62 (comment).
from spacy.
It would also be nice, if you could provide a case insensitive model. The current model is basically useless for social media data, such as tweets, where people often write in all lower case.
from spacy.
@honnibal Really excited about the planned addition of DBPedia. Have you made any progress towards that?
from spacy.
@ma2rten I am expecting the lower case NER feature too.
from spacy.
@honnibal is the lower case NER in progress?
If not, would you mind to give some instructions on how to train that?
Thanks. :)
from spacy.
@honnibal Any progress on the lowercase NER?
from spacy.
@honnibal And the gazetteer?
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Enable override of existing custom pipe HOT 1
- Check that filter_spans input is a Span HOT 3
- Tokenizer Incorrectly Splitting "M1M" HOT 1
- Version incompatibility between Spacy, Cuda, Pytorch and Python HOT 3
- Accessing private transformer models HOT 1
- Problems converting Doc object to/from json HOT 1
- The word transitions to the wrong prototype HOT 1
- Fuzzy Matching not working HOT 1
- Unable to finetune transformer based ner model after initial tuning
- Undesired whitespace normalization of Korean text
- Suggestion: Normalize or Translate the parsing labels for German and English dependency labelling
- Code example discrepancy for `Span.lemma_` in API docs HOT 1
- Signature docs error in API docs for `MorphAnalysis.__contains__` HOT 2
- Import broken python 3.9 HOT 1
- Luminous
- Converting into exe file through pyinstaller-> spacy cannot find factory for 'curated transformer' HOT 1
- Spacy problem with whitespace or punctuation HOT 1
- config.cfg error from spacy init config command
- Possible ORG misidentification HOT 1
- SpaCy is not building today HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.