Giter VIP home page Giter VIP logo

Comments (23)

honnibal avatar honnibal commented on April 28, 2024

Hey,

It's still about 3-4% less accurate than Stanford and MITE, on OntoNotes
and CoNLL 03 It's not an exact replication of anything, but I guess you
could say it's using something like SEARN.

The plan is to switch to using DBPedia as a gazetteer, for entity linking.

On Wednesday, April 29, 2015, Andrew Nystrom [email protected]
wrote:

I see spaCy has an NER now. Very nice. I'm curious about how it compares
to other NER systems. Have you benchmarked it on a standard dataset? What
algorithm are you using? How does it compare to MITIE and Stanford?


Reply to this email directly or view it on GitHub
#62.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

In more detail:

  • Uses same code as the shift/reduce parser, with a transition system based on the BILOU tagging scheme. Instead of coding the problem as a tagging task, we maintain a stack, with the transitions B, I, L, U and O, and constrain the actions so that {I, L} are only valid while an entity is on the stack.
  • At the moment, no syntactic features are being used. But this scheme will make it easy to do joint parsing and NER.
  • Ships model trained on OntoNotes 5
  • On OntoNotes WSJ, scores ~82% evaluating all entity types, reportedly Stanford gets 85%
  • On CoNLL '03, scores ~86% on CoNLL '03, reportedly Stanford gets 90% there. MITE is reporting 88%.

I believe that the addition of syntactic features and gazetteer will raise performance to around the state-of-the-art on these evaluations. However, I don't really buy these data sets as good benchmarks.

I believe these evaluations under-estimate the importance of gazetteers for real-world performance. My plan is to hash the DBPedia entities and hold them in memory. By designing the data structures carefully, and doing a bit of pruning, I think I can do entity linking entirely in-memory. Currently systems rely on DB queries, which imposes a lot of extra complexity.

from spacy.

AWNystrom avatar AWNystrom commented on April 28, 2024

Thank you very much for the fantastic response. I've found MITIE often does some unexpected things, like usually tagging Amazon as a location, or McDonald's as a person. Would the gazetteer help with this?

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location, or
McDonald's as a person. Would the gazetteer help with this?

Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.

The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.

Reply to this email directly or view it on GitHub
#62 (comment).

from spacy.

AWNystrom avatar AWNystrom commented on April 28, 2024

Sounds awesome. Are you thinking bloom filters and count min sketches?
On Mon, May 4, 2015 at 6:41 PM Matthew Honnibal [email protected]
wrote:

Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location,
or
McDonald's as a person. Would the gazetteer help with this?

Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.

The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.

Reply to this email directly or view it on GitHub
#62 (comment).


Reply to this email directly or view it on GitHub
#62 (comment).

from spacy.

elyase avatar elyase commented on April 28, 2024

The DBPedia gazetteer will be great addition.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

I'm thinking that won't be necessary.

DBPedia has about 4 million entities, we can probably prune away 25-50% of them, as there's probably a long tail. And we probably want to store 2-3 aliases per entity.

So, we want to store 12 million 64 bit hashes --- that's only 100mb! No need to do anything special.

We'll also want a per-word gazetteer, that asks "Does this word begin an entity of category X?". But this is a boolean value, and there's still plenty of room in the lexicon's bit vector --- I think I have about 30 values free. So, this won't take any extra memory at all.

from spacy.

AWNystrom avatar AWNystrom commented on April 28, 2024

Do you think a gazetteer is better than just manually adding modern training data to the mix?

from spacy.

derekduoba avatar derekduoba commented on April 28, 2024

Hi. I'm seeing quite a few situations where the NER will not tag properly when an entire sentence or phrase does not have any capital letters. Admittedly, this is a problem across the board with most state-of-the-art taggers last I checked. However, Alan Ritter did some work on this topic about a year ago:

https://github.com/aritter/twitter_nlp
https://aritter.github.io/twitter_ner.pdf

tl;dr He created a tagger that performed reasonably well on noisy Twitter text.

Do you have any plans to add support for robust NER in noisy text? Alternatively, do you plan to add the ability to slot-in other NER modules when necessary?

from spacy.

elyase avatar elyase commented on April 28, 2024

I am also seeing situations where a sentence gets wrongly tagged (and parsed) because of wrong capitalization and would also be interested in this use case (twitter, noisy text).

from spacy.

derekduoba avatar derekduoba commented on April 28, 2024

Actually, I wrote a fairly naive NER for Tweets a few months back, and wouldn't mind rewriting and updating it for this project. Of course, this assumes you are A.) looking for contributors, and B.) willing to wait a month or so while I finish off a couple other projects.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Hi,

I should be rolling out a new model with more robust training within a
week. The new model still lacks a gazetteer, but at least it's trained on
better data. A gazetteer is the nex step after that.

On Sat, May 30, 2015 at 9:45 AM, Derek Duoba [email protected]
wrote:

Actually, I wrote a fairly naive NER for Tweets a few months back, and
wouldn't mind rewriting and updating it for this project. Of course, this
assumes you are A.) looking for contributors, and B.) willing to wait a
month or so while I finish off a couple other projects.


Reply to this email directly or view it on GitHub
#62 (comment).

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Just pushed version 0.85. The NER should be a bit more robust, although it's still not great.

I'm working on various fixes. One idea is to add corruption to the training data, e.g. swap casing etc. I've always noticed this was an effective trick they use in ASR and OCR, and thought it'd be good to put it in an NLP model. Initial results are promising.

Still working on the gazetteer.

from spacy.

lechatpito avatar lechatpito commented on April 28, 2024

wrt the gazetteer, I think it's great that it's based on DBpedia, and that's a feature we're really looking forward as we already use this dataset. However, will it be possible to easily extend the gazetteer with our own lists? For example we would like to link to restaurants names.

from spacy.

honnibal avatar honnibal commented on April 28, 2024

Yes, definitely.

I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.

On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <[email protected]

wrote:

wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.


Reply to this email directly or view it on GitHub
#62 (comment).

from spacy.

AWNystrom avatar AWNystrom commented on April 28, 2024

This sounds fantastic. Can't wait to see how it performs.

On 16 June 2015 at 14:52, Matthew Honnibal [email protected] wrote:

Yes, definitely.

I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.

On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <
[email protected]

wrote:

wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.


Reply to this email directly or view it on GitHub
#62 (comment).


Reply to this email directly or view it on GitHub
#62 (comment).

from spacy.

ma2rten avatar ma2rten commented on April 28, 2024

It would also be nice, if you could provide a case insensitive model. The current model is basically useless for social media data, such as tweets, where people often write in all lower case.

from spacy.

matichorvat avatar matichorvat commented on April 28, 2024

@honnibal Really excited about the planned addition of DBPedia. Have you made any progress towards that?

from spacy.

forrestbao avatar forrestbao commented on April 28, 2024

@ma2rten I am expecting the lower case NER feature too.

from spacy.

bawongfai avatar bawongfai commented on April 28, 2024

@honnibal is the lower case NER in progress?
If not, would you mind to give some instructions on how to train that?
Thanks. :)

from spacy.

icyc9 avatar icyc9 commented on April 28, 2024

@honnibal Any progress on the lowercase NER?

from spacy.

cmuell89 avatar cmuell89 commented on April 28, 2024

@honnibal And the gazetteer?

from spacy.

lock avatar lock commented on April 28, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.