Giter VIP home page Giter VIP logo

Comments (9)

letconex avatar letconex commented on May 18, 2024

Concerning the logic: Is this a viable response?

>>> doc = nlp("This is a majour mistaken.")
>>> print(doc._.outcome_spellCheck)
This is a fact mistaken.
>>> doc = nlp("This is a majour mistake.")
>>> print(doc._.outcome_spellCheck)
This is a major mistake.
>>> doc = nlp("This is a majour mistakes.")
>>> print(doc._.outcome_spellCheck)
This is a for mistakes.
>>> doc = nlp("This is a majour misstake.")
>>> print(doc._.outcome_spellCheck)
This is a minor story.

from contextualspellcheck.

R1j1t avatar R1j1t commented on May 18, 2024

That is not the desired response. But it is based on the current logic. If you want to improve accuracy, please try pass the vocab file

vocab_path (str, optional): Vocabulary file path to be used by the
model . Defaults to "".

This will help model prevent False positives. Feel free to open a PR with a fix!

from contextualspellcheck.

kshitij12345 avatar kshitij12345 commented on May 18, 2024

One side-effect of using the current transformers tokenizer logic is that it would by default support multi-lingual models. Otherwise I am not sure but I think different languages might require different spell-checkers as per the language nuances.

from contextualspellcheck.

R1j1t avatar R1j1t commented on May 18, 2024

As mentioned in the README

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model.

So lets say you want to perform spell correction on Japanese sentence:

  1. provide Japanese spacy model: This will break the sentence into tokens. Now as this model is trained on Japanese language it knows the nuances (better than english model)
  2. Provide the Japanese bert model (from tokenizer models): Which will provide the candidate word for OOV word. Note that vocabulary here is considered of the transformer model and not the spaCy model

Below is some code contributed to the repo for Japanese language:

nlp = spacy.load("ja_core_news_sm")
checker = ContextualSpellCheck(
model_name="cl-tohoku/bert-base-japanese-whole-word-masking",
max_edit_dist=2,
)
nlp.add_pipe(checker)
doc = nlp("しかし大勢においては、ここような事故はウィキペディアの拡大には影響を及ぼしていない。")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

contextualSpellCheck examples folder

I hope it answers your question @kshitij12345. Please feel free to provide ideas or reference if you find something I might have missed something here!

from contextualspellcheck.

stale avatar stale commented on May 18, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from contextualspellcheck.

stale avatar stale commented on May 18, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from contextualspellcheck.

stale avatar stale commented on May 18, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from contextualspellcheck.

piegu avatar piegu commented on May 18, 2024

The current logic of misspell identification relies on vocab.txt from the transformer model. For not so common words tokenizers breaks them into subwords and hence the original entire word might be present as in in vocab.txt

HI @R1j1t,

First of all, congratulations for you Contextual Spell Checker (CSC) based on spaCy and BERT (transformer model).

As I'm searching for this kind of tool, I tested your CSC and I can give the following feedback:

  1. your CSC is an universal Spell Checker as it is possible to dowload the spaCy and BERT model of a language other than English. For example, this is my code for using your CSC in Portuguese in a Colab notebook:
# Installation
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install contextualSpellCheck

# spaCy model in Portuguese
spacy_model = "pt_core_news_md" # 48MB, or "pt_core_news_sm" (20MB), or "pt_core_news_lg"  (577MB)
!python -m spacy download {spacy_model} 

# BERT model in Portuguese
model_name = "neuralmind/bert-base-portuguese-cased" # or "neuralmind/bert-large-portuguese-cased"

# Importation and instantiation of the spaCy model
import spacy
import contextualSpellCheck
nlp = spacy.load(spacy_model)

# Download BERT model and add contextual spellchecker to the spaCy model
nlp.add_pipe(
    "contextual spellchecker",
    config={
        "model_name": model_name,
        "max_edit_dist": 2,
    },
);

# Sentence with errors ("milões" instead of "milhões")
sentence = "A receita foi de $ 9,4 milões em comparação com o ano anterior de $ 2,7 milões."

# Get sentence with corrections (if errors found by CSC)
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (True) A receita foi de $ 9,4 milhões em comparação com o ano anterior de $ 2,7 milhões.
  1. your CSC is an unigram Spell Checker as it uses the [MASK] token of a BERT model to replace a so-called mispelling word by a token from the BERT tokenizer vocab (see post). That means that your CSC can not correct a bigram error for example (see following example).
sentence = "a horta abdominal" # the correct sentence in Portuguese is "aorta abdominal"
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (False) 
# the CSC did not find corrected words with an edit distance < max_edit_dist
  1. your CSC is a word corrector by replacing non vocab words with tokens from the BERT tokenizer vocab (if the their edit distances are inferior to the max_edit_dist). That is the true issue I think (ie, using a BERT model). In fact, by using BERT models, I do not see how your CSC will be able to correct words instead of replacing them. It is true you can pass an infinite vocab file that will allow to detect most of mispelling words but as already said, your CSC will only be able to replace them by one token of the BERT tokenizer vocab (a token is not necessarily a word in the Wordpiece BERT tokenizer that uses subwords as tokens). This means that a "solution" would be to use finetuned BERT models with gigantic vocabulary (in order to have whole words instead of sub-words). Unfortunately, this kind of finetuning would require a huge corpus of texts. And even so, your CSC spell checker would remain a unigram one.

Could you consider exploring another type of transformer model like T5 (or ByT5) which has a seq2seq architecture (BERT as encoder mas GPT as decoder) allowing to have sentences of different sizes in input and output of the model?

from contextualspellcheck.

R1j1t avatar R1j1t commented on May 18, 2024

Hey @piegu, first of I want to thank you for your feedback. It feels terrific to have contributors, and even more so, who help in shaping the logic! When I started this project, I wanted the library to be generalized for multiple languages, hence spaCy and BERT's approach. I created tasks for me (#44, #40), and I would like to read more on these topics. But lately, I have been occupied with my day job and have limited my contributions to contextualSpellCheck.

Regarding your 2nd point, it is something I would agree I did not know. As pointed out in the comment by sgugger:

For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK].

I would still want to depend on transformer models, as it adds the functionality of multilingual support. I will try to experiment with your suggestions and try to think of a solution myself for the same.

Hope you like the project. Feel free to contribute!

from contextualspellcheck.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.