Giter VIP home page Giter VIP logo

contextualspellcheck's Introduction

spellCheck

Contextual word checker for better suggestions

license PyPI Python-Version Downloads GitHub contributors Help Wanted DOI

Types of spelling mistakes

It is essential to understand that identifying whether a candidate is a spelling error is a big task.

Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.

-- Monojit Choudhury et. al. (2007)

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation.

Install

The package can be installed using pip. You would require python 3.6+

pip install contextualSpellCheck

Usage

Note: For use in other languages check examples folder.

How to load the package in spacy pipeline

>>> import contextualSpellCheck
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm") 
>>> 
>>> ## We require NER to identify if a token is a PERSON
>>> ## also require parser because we use `Token.sent` for context
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
>>> contextualSpellCheck.add_to_pipe(nlp)
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
>>> 
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> doc._.outcome_spellCheck
'Income was $9.4 million compared to the prior year of $2.7 million.'

Or you can add to spaCy pipeline manually!

>>> import spacy
>>> import contextualSpellCheck
>>> 
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
>>> # You can pass the optional parameters to the contextualSpellCheck
>>> # eg. pass max edit distance use config={"max_edit_dist": 3}
>>> nlp.add_pipe("contextual spellchecker")
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0>
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
>>> 
>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.

After adding contextual spellchecker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.

Using the pipeline

>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> 
>>> # Doc Extention
>>> print(doc._.contextual_spellCheck)
True
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.suggestions_spellCheck)
{milion: 'million', milion: 'million'}
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
>>> print(doc._.score_spellCheck)
{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}
>>> 
>>> # Token Extention
>>> print(doc[4]._.get_require_spellCheck)
True
>>> print(doc[4]._.get_suggestion_spellCheck)
'million'
>>> print(doc[4]._.score_spellCheck)
[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]
>>> 
>>> # Span Extention
>>> print(doc[2:6]._.get_has_spellCheck)
True
>>> print(doc[2:6]._.score_spellCheck)
{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []}

Extensions

To make the usage easy, contextual spellchecker provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the doc, span and token level. The below tables summarise the extensions.

spaCy.Doc level extensions

Extension Type Description Default
doc._.contextual_spellCheck Boolean To check whether contextualSpellCheck is added as extension True
doc._.performed_spellCheck Boolean To check whether contextualSpellCheck identified any misspells and performed correction False
doc._.suggestions_spellCheck {Spacy.Token:str} if corrections are performed, it returns the mapping of misspell token (spaCy.Token) with suggested word(str) {}
doc._.outcome_spellCheck str corrected sentence(str) as output ""
doc._.score_spellCheck {Spacy.Token:List(str,float)} if corrections are identified, it returns the mapping of misspell token (spaCy.Token) with suggested words(str) and probability of that correction None

spaCy.Span level extensions

Extension Type Description Default
span._.get_has_spellCheck Boolean To check whether contextualSpellCheck identified any misspells and performed correction in this span False
span._.score_spellCheck {Spacy.Token:List(str,float)} if corrections are identified, it returns the mapping of misspell token (spaCy.Token) with suggested words(str) and probability of that correction for tokens in this span {spaCy.Token: []}

spaCy.Token level extensions

Extension Type Description Default
token._.get_require_spellCheck Boolean To check whether contextualSpellCheck identified any misspells and performed correction on this token False
token._.get_suggestion_spellCheck str if corrections are performed, it returns the suggested word(str) ""
token._.score_spellCheck [(str,float)] if corrections are identified, it returns suggested words(str) and probability(float) of that correction []

API

At present, there is a simple GET API to get you started. You can run the app in your local and play with it.

Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY Note: Your browser can handle the text encoding

GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.

Response:

{
    "success": true,
    "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "suggestion_score": {
        "milion": [
            [
                "million",
                0.59422
            ],
            [
                "billion",
                0.24349
            ],
            ...
        ],
        "milion:1": [
            [
                "billion",
                0.65934
            ],
            [
                "million",
                0.26185
            ],
            ...
        ]
    }
}

Task List

  • use cython for part of the code to improve performance (#39)
  • Improve metric for candidate selection (#40)
  • Add examples for other langauges (#41)
  • Update the logic of misspell identification (OOV) (#44)
  • better candidate generation (solved by #44?)
  • add metric by testing on datasets
  • Improve documentation
  • Improve logging in code
  • Add support for Real Word Error (RWE) (Big Task)
  • add multi mask out capability
Completed Task

  • specify maximum edit distance for candidateRanking
  • allow user to specify bert model
  • Include transformers deTokenizer to get better suggestions
  • dependency version in setup.py (#38)

Support and contribution

If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an issue. If you can help with any of the above tasks, please open a PR with necessary changes to documentation and tests.

Cite

If you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry:

@misc{Goel_Contextual_Spell_Check_2021,
author = {Goel, Rajat},
doi = {10.5281/zenodo.4642379},
month = {3},
title = {{Contextual Spell Check}},
url = {https://github.com/R1j1t/contextualSpellCheck},
year = {2021}
}

Reference

Below are some of the projects/work I referred to while developing this package

  1. Explosion AI.Architecture. May 2020. url:https://spacy.io/api.
  2. Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007).
  3. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL].
  4. Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref.
  5. Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3.
  6. Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522.
  7. Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html.
  8. Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL].
  9. Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6.

contextualspellcheck's People

Contributors

adheeshk13 avatar adkiem avatar alvarocavalcante avatar dc-aichara avatar r1j1t avatar sobolevn avatar tpanza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

contextualspellcheck's Issues

UnicodeDecodeError

'charmap' codec can't decode byte 0x90 in position 3920: character maps to undefined
while adding contextualSpellCheck to spacy pipeline

[BUG] Installation fails on macOS 12.6.7

Describe the bug
pip install contextualSpellCheck fails on macOS 12.6.7; M1 Pro MacBook Pro
TL;DR:

editdistance/bycython.cpp:216:12: fatal error: 'longintrepr.h' file not found

I guess it's less of a bug than just the fact that it does not build for macOS 12.x yet?

To Reproduce

$ pip install contextualSpellCheck

Expected behavior
Successful installation

Version (please complete the following information):

  • contextualSpellCheck [0.4.3]
  • Spacy: [3.6.0]
  • transformers [4.30.2]
  • Command Line Tools for Xcode []

Additional information

  • I'm using miniforge3
  • This is the complete install prompt:
$ pip install contextualSpellCheck
Collecting contextualSpellCheck
  Using cached contextualSpellCheck-0.4.3-py3-none-any.whl (128 kB)
Collecting torch>=1.4 (from contextualSpellCheck)
  Using cached torch-2.0.1-cp311-none-macosx_11_0_arm64.whl (55.8 MB)
Collecting editdistance==0.6.0 (from contextualSpellCheck)
  Using cached editdistance-0.6.0.tar.gz (29 kB)
  Preparing metadata (setup.py) ... done
Collecting transformers>=4.0.0 (from contextualSpellCheck)
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Requirement already satisfied: spacy>=3.0.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from contextualSpellCheck) (3.6.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (1.0.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (1.0.9)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (2.0.7)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (3.0.8)
Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (8.1.10)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (2.4.6)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (2.0.8)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (0.9.0)
Requirement already satisfied: pathy>=0.10.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (0.10.2)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (5.2.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (4.65.0)
Requirement already satisfied: numpy>=1.15.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (1.25.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (1.10.11)
Requirement already satisfied: jinja2 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (3.1.2)
Requirement already satisfied: setuptools in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (68.0.0)
Requirement already satisfied: packaging>=20.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (23.1)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from spacy>=3.0.0->contextualSpellCheck) (3.3.0)
Collecting filelock (from torch>=1.4->contextualSpellCheck)
  Using cached filelock-3.12.2-py3-none-any.whl (10 kB)
Requirement already satisfied: typing-extensions in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from torch>=1.4->contextualSpellCheck) (4.7.1)
Collecting sympy (from torch>=1.4->contextualSpellCheck)
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch>=1.4->contextualSpellCheck)
  Using cached networkx-3.1-py3-none-any.whl (2.1 MB)
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers>=4.0.0->contextualSpellCheck)
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Collecting pyyaml>=5.1 (from transformers>=4.0.0->contextualSpellCheck)
  Using cached PyYAML-6.0-cp311-cp311-macosx_11_0_arm64.whl (167 kB)
Collecting regex!=2019.12.17 (from transformers>=4.0.0->contextualSpellCheck)
  Using cached regex-2023.6.3-cp311-cp311-macosx_11_0_arm64.whl (288 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers>=4.0.0->contextualSpellCheck)
  Using cached tokenizers-0.13.3-cp311-cp311-macosx_12_0_arm64.whl (3.9 MB)
Collecting safetensors>=0.3.1 (from transformers>=4.0.0->contextualSpellCheck)
  Using cached safetensors-0.3.1-cp311-cp311-macosx_12_0_arm64.whl (401 kB)
Collecting fsspec (from huggingface-hub<1.0,>=0.14.1->transformers>=4.0.0->contextualSpellCheck)
  Using cached fsspec-2023.6.0-py3-none-any.whl (163 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->contextualSpellCheck) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->contextualSpellCheck) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->contextualSpellCheck) (2.0.3)
Requirement already satisfied: certifi>=2017.4.17 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->contextualSpellCheck) (2023.5.7)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from thinc<8.2.0,>=8.1.8->spacy>=3.0.0->contextualSpellCheck) (0.7.9)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from thinc<8.2.0,>=8.1.8->spacy>=3.0.0->contextualSpellCheck) (0.1.0)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from typer<0.10.0,>=0.3.0->spacy>=3.0.0->contextualSpellCheck) (8.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/manu_neu/miniforge3/envs/spaCy/lib/python3.11/site-packages (from jinja2->spacy>=3.0.0->contextualSpellCheck) (2.1.3)
Collecting mpmath>=0.19 (from sympy->torch>=1.4->contextualSpellCheck)
  Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Building wheels for collected packages: editdistance
  Building wheel for editdistance (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-11.0-arm64-cpython-311
      creating build/lib.macosx-11.0-arm64-cpython-311/editdistance
      copying editdistance/__init__.py -> build/lib.macosx-11.0-arm64-cpython-311/editdistance
      copying editdistance/_editdistance.h -> build/lib.macosx-11.0-arm64-cpython-311/editdistance
      copying editdistance/def.h -> build/lib.macosx-11.0-arm64-cpython-311/editdistance
      running build_ext
      building 'editdistance.bycython' extension
      creating build/temp.macosx-11.0-arm64-cpython-311
      creating build/temp.macosx-11.0-arm64-cpython-311/editdistance
      clang -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/manu_neu/miniforge3/envs/spaCy/include -arch arm64 -fPIC -O2 -isystem /Users/manu_neu/miniforge3/envs/spaCy/include -arch arm64 -I./editdistance -I/Users/manu_neu/miniforge3/envs/spaCy/include/python3.11 -c editdistance/_editdistance.cpp -o build/temp.macosx-11.0-arm64-cpython-311/editdistance/_editdistance.o
      clang -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/manu_neu/miniforge3/envs/spaCy/include -arch arm64 -fPIC -O2 -isystem /Users/manu_neu/miniforge3/envs/spaCy/include -arch arm64 -I./editdistance -I/Users/manu_neu/miniforge3/envs/spaCy/include/python3.11 -c editdistance/bycython.cpp -o build/temp.macosx-11.0-arm64-cpython-311/editdistance/bycython.o
      editdistance/bycython.cpp:216:12: fatal error: 'longintrepr.h' file not found
        #include "longintrepr.h"
                 ^~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for editdistance
  Running setup.py clean for editdistance
Failed to build editdistance
ERROR: Could not build wheels for editdistance, which is required to install pyproject.toml-based projects

Change pre-trained model?

I'm trying to create a spell checker proof-of-concept (POC) for an e-commerce search engine. We're already using Transformers architecture of other tasks and I thought about trying it also with spell checker.

I've came across this beatiful API and I want to give it a try. I've seen it uses BERT classical pre-trained model. But I need to use a pre-trained model in portuguese (such as 'BERTimbau') or multi-cross lingual (such as miniLM).

It would be good if we could pass the desired pre-trained model as a parameter for the function.

I may be wrong and it's already implemented. Correct me if I'm wrong. Is there an easy solution or where I can choose my pre-trained model without going low-level?

[BUG] Sentence context greater than 512 character

I tried to correct spelling mistakes in a large text.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

At first, I faced this error:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

So, I added the sentencizer component to the pipeline.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
spacy_nlp.add_pipe('sentencizer')
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

This time I faced this error:
RuntimeError: The expanded size of the tensor (837) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 837]. Tensor sizes: [1, 512]

I guess this is due to the limitations of BERT. However, I believe that there should be a way to catch this error and bypass the spell check.

Missing requirements

I installed this great package using pip install contextualSpellCheck, but when I tried to test I got a sequence of errors because of missing requirements. (Pytorch, Transformers, and editdistance).
No problem here! I just installed these missing packages manually but would be nice if it installs automatically.
So, I recommend modifying the setup.py file with the following configuration.
install_requires=[
torch==1.6.0,
editdistance==0.5.3,
transformers==3.0.2
]

Spanish contextual spell check

Hi, I loaded this wonderful module into a spacy Spanish model. And did not work. I assume it because the training was done with English dataset. Is three any way to make this algorithm works with Spanish spacy model. Thank you.

[BUG] PyPI package missing required dependencies

Describe the bug
I can't install all the required dependencies using pip install contextualSpellCheck. It seems like PyPI package doesn't contain the fix.

To Reproduce

python -m venv venv
source venv/bin/activate
pip install contextualSpellCheck
python -c 'import contextualSpellCheck'

ModuleNotFoundError: No module named 'torch'

Expected behavior
torch and other deps autoinstallation.

Version (please complete the following information):
latest 0.4.0 contextualSpellCheck from PyPI.

Additional information
see #17 and #18.

dependency version number in setup.py

This will help prevent code break when some user has an old dependency.

Considerations: Will need to check the lowest version to minimise the unnecessary updates

Pipeline dependencies?

Wonderful project, seems a lot like something I would like to use for dacy, a danish NLP pipeline build on spacy. Currently, it seems like the spelling correction is dependent on the NER and other pipelines? If so this will sadly make it less useful for application before applying the pipeline.

error while adding to pipeline

import spacy
import contextualSpellCheck

nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)

I get the following error

UnicodeDecodeError Traceback (most recent call last)
in
1 nlp = spacy.load('en_core_web_sm')
----> 2 contextualSpellCheck.add_to_pipe(nlp)

C:\ProgramData\Anaconda3\lib\site-packages\contextualSpellCheck_init_.py in add_to_pipe(nlp, **kwargs)
6
7 def add_to_pipe(nlp, **kwargs):
----> 8 checker = ContextualSpellCheck()
9 nlp.add_pipe(checker)
10 return nlp

C:\ProgramData\Anaconda3\lib\site-packages\contextualSpellCheck\contextualSpellCheck.py in init(self, vocab_path, debug, performance)
29 # if want to remove '[unusedXX]' from vocab
30 # words = [line.rstrip() for line in f if not line.startswith('[unused')]
---> 31 words = [line.rstrip() for line in f]
32 self.vocab = Vocab(strings=words)
33 self.BertTokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

C:\ProgramData\Anaconda3\lib\site-packages\contextualSpellCheck\contextualSpellCheck.py in (.0)
29 # if want to remove '[unusedXX]' from vocab
30 # words = [line.rstrip() for line in f if not line.startswith('[unused')]
---> 31 words = [line.rstrip() for line in f]
32 self.vocab = Vocab(strings=words)
33 self.BertTokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

C:\ProgramData\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3920: character maps to

Another BERT, BioBERT ?

Hi,
Thanks for this library !
Is it possible to change BERT to BioBERT ?
I worked on scientific articles, and I would like to correct specific terms, like name of disease that the transcriptions were wrong. For instance, I have "covet 19" on the transcription, instead of "covid 19", and I would like to correct that.
Can you help me please ?
Thanks,
Cheers,
Camille

Words being corrected ##ts [BUG]

Describe the bug
Words tagged as incorrect are replaced with a word with hashtags.

To Reproduce

#Steps to reproduce the behavior:
>>> import spacy
>>> nlp = spacy.load('en_core_web_lg', disable=['tagger'])
>>> from contextualSpellCheck import ContextualSpellCheck
2020-10-14 10:24:16.775668: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
>>> merge_ents = nlp.create_pipe("merge_entities")
>>> nlp.add_pipe(merge_ents)
>>> spell_checker = ContextualSpellCheck(max_edit_dist=3)
>>> nlp.add_pipe(spell_checker)
>>> sent = 'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the haves and the have nots.'
>>> doc = nlp(sent)
>>> correct = doc._.outcome_spellCheck
>>> correct
'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the have ##ts.'

Expected behavior
'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the have nots.'
or
'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the have not.'

Version:

  • contextualSpellCheck 0.3.0
  • Spacy: 2.3.2
  • transformers 3.3.1

Additional information
I checked the vocab.txt and there are words with ## in the word. I am wondering what the need for these are.

[BUG]

Describe the bug

I apologize in advance if this issue I am about to describe is simply some kind of user error rather than an actual issue. Please understand this is my first time using Spacy and contextualSpellCheck and I believe I am using them correctly however there is always the chance I am not.

That said, my application is using contextualSpellCheck to check the spelling in tweets and recommend fixes for misspelled words. In doing this, I have found that the spelling corrections are almost always incorrect, and often times completely illogical.

For example, in the tweet:

"@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love"

contextual spell check indicates the word "flirt" is misspelled (which it is not) and recommends the illogical spelling correction of "#".

Please see the image below for a few more examples.

--
illogical_spelling_corrections

I have a dataset of over 20k tweets, and have created a function that will process a given number of these tweets using Spacy and contextualSpellCheck. The function stores all top spelling correction options in a csv file. Using my function (link provided below) you can easily reproduce this issue and create as many examples of these incorrect spelling suggestions as needed (using my code will make it very easy to produce the problem).

To Reproduce

#Steps to reproduce the behavior:

1. Download the colab notebook here: https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/01_2_Data_Cleaning_Spacy_and_Spellcheck.ipynb

2. Download the tweet data here: https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Train_Test_Datasets/train_tweets_with_emojis_clean.csv

3. Download two more supporting data files these two links:

https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Supporting_Data_Files/contractions.csv

https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Supporting_Data_Files/sms_speak.csv

4. Run the colab notebook. The test driver is called in cell 47, and will generate an output csv file that shows you all the recommendations contextualSpellCheck made. You can change the test "start" and "end" index in the test drive function call, and that will run the test of a different set of tweets and create a new csv file. 

5. Inspect the output csv file to determine if the spelling suggestions are reasonable. output csv file will have a name formatted as:

num1_to_num2_spellcheck_test_results.csv 

where num1 and num2 are the start and end tweet indexes you passed to the test drive function.

Expected behavior
I expected the spelling recommendations to be at least reasonable. In many cases spelling suggestions involve changing a correctly spelled readable word to simply a punctuation mark like "." or "#".

Version (please complete the following information):

  • contextualSpellCheck version: 0.4.1
  • Spacy version: 3.0.5

Additional information

Please be aware that the way I have structured the code to capture all of this information about what contextualSpellCheck is doing requires a lot of RAM. I do not recommend running the test driver function on more than 50 tweets at one time.

As I mentioned at the beginning I am inexperienced with both Spacy and contextualSpellCheck. I have shared my current use case and implementation and I welcome any advice on how to better use these tools. Beyond getting spellcheck working, I want to find a way to process all of these tweets without generating so many doc objects, as I believe this is what is causing so much RAM usage, which has led to slow performance and crashes.

Despite all of that, as far as I can tell the contextualSpellCheck is not giving reasonable spelling recommendations for this tweet processing application. I have dedicated a significant amount of time to troubleshooting this, and I finally decided to raise the issue with you guys. I would really like to use your tool if it can be used for this task. Please help me understand if this is an actual bug, or some kind of user error.

Thank you,
Braden

How to use it efficiently in the pipeline?

Hi! Thanks for this package. I am trying to use this one with sentencizer and merge_entities pipelines but I am not sure what is the best way to combine them. My objective is to:

  1. parse a long text into sentences
  2. fix possible typos in sentences
  3. tokenize each sentence, find and merge entities within sentences

What is more, when I tried to load checker in pipeline, it sometimes will return errors or does not find typos at all.

Information on package version:

system and package info:

>>> spacy.__version__
'2.2.3'
>>> sys.version
'3.7.6 | packaged by conda-forge | (default, Dec 27 2019, 00:09:34) \n[GCC 7.3.0]'

Example of not fixing typo:

CLICK ME

>>> nlp = spacy.load('en_core_web_sm')#, disable=["tagger", "parser"])
>>> merge_ents = nlp.create_pipe("merge_entities")
>>> checker = ContextualSpellCheck()
>>> nlp.add_pipe(checker)
>>> nlp.add_pipe(merge_ents)
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> doc._.outcome_spellCheck
'Income was $9.4 milion compared million the prior year of $2.7 milion.'

Example of error output:

CLICK ME

>>> nlp = spacy.load('en_core_web_sm', disable=["parser"]) #"tagger", 
>>> merge_ents = nlp.create_pipe("merge_entities")
>>> checker = ContextualSpellCheck()
>>> nlp.add_pipe(checker)
>>> nlp.add_pipe(merge_ents)
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
-------------------------------------------------------------------------------------------------
<stdin> 1 <module>
language.py 435 __call__
doc = proc(doc, **component_cfg.get(name, {}))
contextualSpellCheck.py 74 __call__
candidate = self.candidateGenerator(doc, misspellTokens)
contextualSpellCheck.py 183 candidateGenerator
for i in token.sent:
TypeError:
'NoneType' object is not iterable

I believe the error has something to do with the default tagger pipe. But still, loading the merge_entities pipe will somehow broke spell checking?

[BUG] Windows users with Python 3.8+ cannot resolve editdistance==0.5.3 dependency

Describe the bug
A Windows user running Python 3.8 or 3.9 cannot successfully run pip install contextualSpellCheck. (Unless Microsoft C++ Build Tools has been installed ahead of time.)

To Reproduce

conda create -n myenv python=3.9
conda activate myenv
python -m pip install contextualSpellCheck

Building wheels for collected packages: editdistance
  Building wheel for editdistance (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-39
      creating build\lib.win-amd64-cpython-39\editdistance
      copying editdistance\__init__.py -> build\lib.win-amd64-cpython-39\editdistance
      copying editdistance\_editdistance.h -> build\lib.win-amd64-cpython-39\editdistance
      copying editdistance\def.h -> build\lib.win-amd64-cpython-39\editdistance
      running build_ext
      building 'editdistance.bycython' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for editdistance
  Running setup.py clean for editdistance
Failed to build editdistance
Installing collected packages: editdistance, contextualSpellCheck
  Running setup.py install for editdistance ... error
  error: subprocess-exited-with-error

  × Running setup.py install for editdistance did not run successfully.
  │ exit code: 1
  ╰─> [14 lines of output]
      running install
      C:\Users\sv182c\work\mambaforge\envs\bellablue\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-39
      creating build\lib.win-amd64-cpython-39\editdistance
      copying editdistance\__init__.py -> build\lib.win-amd64-cpython-39\editdistance
      copying editdistance\_editdistance.h -> build\lib.win-amd64-cpython-39\editdistance
      copying editdistance\def.h -> build\lib.win-amd64-cpython-39\editdistance
      running build_ext
      building 'editdistance.bycython' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> editdistance

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Expected behavior
Installation to succeed on Windows for Python 3.8 or 3.9, without having to install Microsoft C++ Build Tools.

Version (please complete the following information):

  • contextualSpellCheck 0.4.2
  • Spacy: 3.4.1
  • transformers 4.21.1

Additional information

The setup.py script for contextualSpellCheck has its editdistance dependency pinned to version 0.5.3. That release of editdistance only has prebuilt binary wheels for up to Python 3.7.

A newer release of editdistance, 0.6.0, contains no code changes but includes binary wheels for more versions of Python (up to 3.9).

So if the editdistance dependency version were simply relaxed to something like >= 0.5.3 or ~= 0.5, then pip would find the 0.6.0 release and use those pre-built binary wheels. Since there are no underlying code changes to that release of editdistance, that would be a simple and low-risk change for contextualSpellCheck.

How to specify bert model?

I use spacy 2.3.1 which support chinese model, and I wanna use contextualspellcheck to check chinese spell,but the result is hard to understand, so I think I should specify a chinese bert model.
Can I specify it by code? or I must modify source code to do that?
thanks very much.

Update the logic of misspell identification

Is your feature request related to a problem? Please describe.
The current logic of misspelling identification relies on vocab.txt from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in vocab.txt. Hence the original word might not be present in vocab.txt and be identified as misspelt.

Describe the solution you'd like
Still not clear, need to look into some papers on this.

Describe alternatives you've considered
Alternate which I can think of right now will be 2 folds:

  • ask user to provide list of such words and append in the vocab.txt from the transformers model
  • if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word

Additional context
#30 explosion/spaCy#3994

Issue in usage in secured-corporate environment due to SSL certificate verification [BUG]

Describe the bug
Facing issue due to SSL certificate verification case in secured corporate environment.

To Reproduce

#Steps to reproduce the behavior:
import spacy
import contextualSpellCheck

nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)
doc = nlp(text)
print(doc._.outcome_spellCheck)

Expected behavior
Parameter to turn SSL verification False.

Version (please complete the following information):

  • contextualSpellCheck [e.g. 0.3.0]
  • Spacy: [e.g. 2.3.2]
  • transformers [e.g. 3.1.0]

Improve general code quality for maintenance purposes

Is your feature request related to a problem? Please describe.
In my last contribution, I noticed that the code has some improvements that can be done to let it be easier to maintain. The linting note reveals that has some work to do!
image

Describe the solution you'd like
I would like to create a PR to improve the code style. The changes will not affect the functioning of the lib.

French (doc add)

As requested in #41, here is how I succeeded in running contextualSpellCheck for French.

Use French spaCy model:

nlp = spacy.load("fr_core_news_sm")

Use camembert/camembert-base-ccnet:

nlp.add_pipe("contextual spellchecker", config={"max_edit_dist": 4,"model_name": "camembert/camembert-base-ccnet"})

Need these dependencies:

pip install sentencepiece
pip install protobuf==3.20

Remark: on the result spaces are lost, thus need a post-processing to get them back properly.

PS: for flaubert/flaubert_large_cased model, need this dependency

pip install sacremoses

Fine Tuning with custom data

Hello Everybody,

First, congratulations for this amazing lib.

I need a fine tuning with a custom data, because my domain is specific. I can? How?

Add examples in our documentation

We have examples of how to use this package in English and Japanese. It would help if we can improve our documentation so that it becomes easy to understand contextualSpellCheck for others maybe add documentation in other languages.

Adding Damerau, case-insensitivity comparison, Bayesian decision making options

Describe the problem you met

Hi, thanks for this useful lib, I think it can be improved even further! ;-)

I was disappointed by the lib output for this case:

doc = nlp(
    "Visa to dubia ",
)

print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)
print(doc._.score_spellCheck)

True
Visa to.
{dubia: [('.', 0.72154), (';', 0.07971), ('?', 0.04526), ('Norway', 0.00669), ('Estonia', 0.00593), ('Taiwan', 0.00555), ('Bermuda', 0.00531), ('Iceland', 0.00479), ('Italy', 0.00466), ('Germany', 0.00437)]}

Describe the solution you'd like
Obviously, top_n parameter which is internal to the candidate_ranking method should be made a part of constructor.
I made that adjustment locally, and unexpectedly, Cuba became a winner instead of Dubai. 2 reasons were immediately evident:

  1. candidates to misspell comparisons are case-sensitive, which is not good for accuracy
  2. symbols swaps are considered of the same cost as 1 addition and one deletion, which is clearly a drawback.

Therefore, optional case insensitivity and Damerau-Levenshtein distance instead of plain Levenshtein are desired.
Once implemented locally, I arrived at the desired outcome (Dubai).

However, it felt as a lucky coincidence, as the probabilities coming from underlying Bert were not accounted for when making a decision. I thought it would be very natural to be able to handle them in conjunction with the textual similarities in a Bayesian fashion, where Bert probs could be our prior, cand to misspell textual sim would be B/A conditional probs of B being the correct guess given that A was typed in. To soften the impact of possible Bert miscalibration, absolute probs conversion to ranks would be desired.

Summary of my suggestions:
feature request could be implemented by adding a few parameters to the class constructor:

top_n (int, optional): suggestions from underlying ANN model to be considered. Defaults to 10.
lowercased_distance (bool, optional): lowercase candidates before computing edit distance. Defaults to True.
damerau_distance (bool, optional): additionally account for symbol swaps when calculating a distance. Defaults to True.
bayes_selection (bool, optional): use bayes reasoning when selecting the best candidate. Bert probabilities are the prior, textual similarities of candidates to the input are treated as the probabilities B/A that the correct candidate is A, while the input was B. Defaults to True.
ranked_bert_probs (bool, optional): use ranked probs as opposed to the absolute probs values coming from Bert. Defaults to True.

Calling code would become:
contextualSpellCheck.add_to_pipe(nlp,config=dict(top_n=500,lowercased_distance =True,damerau_distance =True,bayes_selection =True,ranked_bert_probs =False))

After the implementation, I arrive at Dubai instead of the dot, which makes me happy )

Describe alternatives you've considered
I was not able to find any alternative solution. Tried a few Spacy models, to no avail.

Additional context
I used

            Doc.set_extension("bayes_probs", default=None)
            Doc.set_extension("bayes_details", default=None)

to provide additional details for the occasional exigent user:

screenshot detailing resulting candidates rating and bayesian details

If you think the community can benefit from this new functionality, I'm happy to merge my PR.

[BUG] Errors while installing in CONDA environment using pip

Describe the bug
When we use the conda environent and try downloading this

To Reproduce

conda activate
pip install contextualSpellCheck

Stack Trace
  ERROR: Failed cleaning build dir for torch
Failed to build torch
Installing collected packages: sentencepiece, protobuf, sacremoses, tokenizers, transformers, editdistance, torch, contextualSpellCheck
    Running setup.py install for torch ... error
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\Vishal Shah\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Vishal Shah\\AppData\\Local\\Temp\\pip-install-xcy0juhd\\torch\\setup.py'"'"'; __file__='"'"'C:\\Users\\Vishal Shah\\AppData\\Local\\Temp\\pip-install-xcy0juhd\\torch\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\Vishal Shah\AppData\Local\Temp\pip-record-ajqwkmnl\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\Vishal Shah\anaconda3\Include\torch'
         cwd: C:\Users\Vishal Shah\AppData\Local\Temp\pip-install-xcy0juhd\torch\
    Complete output (23 lines):
    running install
    running build_deps
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\Vishal Shah\AppData\Local\Temp\pip-install-xcy0juhd\torch\setup.py", line 225, in <module>
        setup(name="torch", version="0.1.2.post2",
      File "C:\Users\Vishal Shah\anaconda3\lib\site-packages\setuptools\__init__.py", line 165, in setup
        return distutils.core.setup(**attrs)
      File "C:\Users\Vishal Shah\anaconda3\lib\distutils\core.py", line 148, in setup
        dist.run_commands()
      File "C:\Users\Vishal Shah\anaconda3\lib\distutils\dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "C:\Users\Vishal Shah\anaconda3\lib\distutils\dist.py", line 985, in run_command
        cmd_obj.run()
      File "C:\Users\Vishal Shah\AppData\Local\Temp\pip-install-xcy0juhd\torch\setup.py", line 99, in run
        self.run_command('build_deps')
      File "C:\Users\Vishal Shah\anaconda3\lib\distutils\cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "C:\Users\Vishal Shah\anaconda3\lib\distutils\dist.py", line 985, in run_command
        cmd_obj.run()
      File "C:\Users\Vishal Shah\AppData\Local\Temp\pip-install-xcy0juhd\torch\setup.py", line 51, in run
        from tools.nnwrap import generate_wrappers as generate_nn_wrappers
    ModuleNotFoundError: No module named 'tools.nnwrap'
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\Vishal Shah\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Vishal Shah\\AppData\\Local\\Temp\\pip-install-xcy0juhd\\torch\\setup.py'"'"'; __file__='"'"'C:\\Users\\Vishal Shah\\AppData\\Local\\Temp\\pip-install-xcy0juhd\\torch\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\Vishal Shah\AppData\Local\Temp\pip-record-ajqwkmnl\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\Vishal Shah\anaconda3\Include\torch' Check the logs for full command output.


**Expected behavior**
A clear and concise description of what you expected to happen.

**Version (please complete the following information):**
 - contextualSpellCheck [e.g. 0.3.0]
 - Spacy: [e.g. 2.3.2]
 - transformers [e.g. 3.1.0]

**Additional information**
Add any other information about the problem here.

Rewrite part of library in Cython

At present the time-spent on 1 sentence is around ~0.7sec for spaCy small models on CPU. With the assumption that Cython can bring down the compute time.

outcome_spellCheck and score_spellCheck do not match

Describe the bug
I am not sure if I am using this wrong, but it seems like the data in score_spellCheck does not match the outcome_spellCheck final output. I am assuming that outcome_spellCheck is a result of performing the spelling corrections with the highest probability score?

To Reproduce

import en_core_web_lg
import contextualSpellCheck

nlp = en_core_web_lg.load()
contextualSpellCheck.add_to_pipe(nlp)

tests = [
    "John says he is feeling depresed.",
    "Mary admits to drug adiction and forming bad habbits."
]

for test in tests:
    print("Input: ", test)
    doc = nlp(test)

    print("outcome_spellCheck:", doc._.outcome_spellCheck)
    print("score_spellCheck:")

    for token, suggestions in doc._.score_spellCheck.items():
        for suggestion in suggestions:
            suggested, score = suggestion
            print("  ", token.text, "->", suggested, " (" + str(score) +")")

Output:

Input:  John says he is feeling depresed.
outcome_spellCheck: John says he is feeling depressed.
score_spellCheck:
   depresed -> better  (0.14048)
   depresed -> sick  (0.06199)
   depresed -> fine  (0.04419)
   depresed -> well  (0.04254)
   depresed -> tired  (0.03723)
   depresed -> guilty  (0.03161)
   depresed -> good  (0.03133)
   depresed -> ill  (0.02869)
   depresed -> depressed  (0.02555)
   depresed -> bad  (0.0251)
   
Input:  Mary admits to drug adiction and forming bad habbits.
outcome_spellCheck: Mary admits to drug addiction and forming bad habits.
score_spellCheck:
   adiction -> ##ging  (0.50065)
   adiction -> use  (0.12457)
   adiction -> addiction  (0.09937)
   adiction -> dealing  (0.08391)
   adiction -> abuse  (0.07905)
   adiction -> trafficking  (0.01413)
   adiction -> ##gies  (0.00674)
   adiction -> drinking  (0.0065)
   adiction -> driving  (0.00415)
   adiction -> ##king  (0.00372)
   habbits -> relationships  (0.35707)
   habbits -> habits  (0.20654)
   habbits -> memories  (0.12251)
   habbits -> dreams  (0.0941)
   habbits -> thoughts  (0.02157)
   habbits -> alliances  (0.01681)
   habbits -> bonds  (0.01669)
   habbits -> marriages  (0.01668)
   habbits -> feelings  (0.01495)
   habbits -> plans  (0.01093)

Version (please complete the following information):

  • contextualSpellCheck: 0.3.3
  • Spacy: 2.3.2
  • transformers: 4.1.1

Additional information
Note how the outcome_spellCheck has a good final output. However, score_spellCheck has corrections that are wildly unexpected. They seem more like synonyms than actual spelling corrections (e.g. "relationships" is no where close to the spelling "habbits"). Note how "depressed" got a miserable 0.02555 score, listed way below other corrections that are much farther from the original word.

[BUG] Apostrophes

Removing "s" after apostrophe
When apostrophes are in sentence yields weird results.

To Reproduce

#Steps to reproduce the behavior:
1. Init spacy model '...'
2. Add contextualSpellCheck '....'
3. supply the sentence "Spell Checking based on Peter Norvig’s blog post."
4. doc._.outcome_spellCheck gives result: "Spell Checking based on Peter Norvig, blog post."

Expected behavior
" 's " should not be touched.

Version (please complete the following information):

  • contextualSpellCheck [e.g. 0.3.0]
  • Spacy: [e.g. 2.3.2]
  • transformers [e.g. 3.1.0]

Bad performance for other language

Hello,
I'm trying to use the contextual spell checker for Spanish. I run the script in https://github.com/R1j1t/contextualSpellCheck/blob/88bbbb46252c534679b185955fd88c239ed548a7/examples/ja_example.py with the following custom configuration:

import spacy
import contextualSpellCheck

nlp = spacy.load("es_dep_news_trf")

nlp.add_pipe(
	"contextual spellchecker",
	config={
		"model_name": "bert-base-multilingual-cased",
		"max_edit_dist": 2,
	},
)

doc = nlp("La economia a crecido un dos por ciento.")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

but I don't get the desired result

La economia a crecido un dos por ciento should be corrected as La economía ha crecido un dos por ciento
Instead, I get La economia a crecido un dos por cento

If I use another pre-trained model (e.g. "model_name": "PlanTL-GOB-ES/roberta-large-bne") , the result keeps wrong:
Laeconomiaacrecidoundosporciento. ??
I wonder if I'm using the proper script to run the spellchecker in another language.

Unnecessary dependency on flask

From what I see, flask is only used in the example file modelAPI.py in RESTAPI.

The core code is not at all dependent on flask, however it is a required dependency

I think it would be better to just add note in modelAPI.py to install flask and remove it from requirements.txt.

Thanks for the interesting project.

Facing some error while Installing this package via command prompt

I was trying to install this package by using the command pip install contextualSpellCheck.
I am able to install some of the files but its giving me legacy install failure error.
What should I do in this case can someone please help?
I am getting an error while installing the editdistance package(which is probably the core thing based on Edit Distance Algorithm)
Can someone please help regarding this,

image

Methodology for spell check

Can you please mention the methodology followed in detection and correction of mispelled words,may be at higher level,if possible, thank you

[BUG] compatibility issue with spacy v3

I'm testing the way it's mentioned in the examples as follows:

import spacy
from contextualSpellCheck import ContextualSpellCheck

nlp = spacy.load("en_core_web_sm")
checker = ContextualSpellCheck(max_edit_dist=100)
nlp.add_pipe(checker)

doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
print(doc._.outcome_spellCheck)

Getting error:

`nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x7f923add6c10> (name: 'None')

Can you please suggest the fix for this??

add proxy setting when initialize "contextual spellchecker"

Is your feature request related to a problem? Please describe.
our work is behind a firewall, need a proxy setting to fetch the transformer model "bert-base-cased", I tried to download via "AutoTokenizer.from_pretrained('bert-base-cased', proxies={'http':PROXY, 'https':PROXY})"; however, looks like it cannot find the path. so I have to change 107 line by adding the proxy:
self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name, proxies={'http':PROXY, 'https':PROXY})

Describe the solution you'd like
I want the package can accept the proxy setting when initialize the class

Describe alternatives you've considered
or I can download via transformer from_pretrained function, while "contextual spellchecker" can find it properly

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Language.factory

Describe the bug
A clear and concise description of what the bug is.
I have installed contextualSpellCheck in Colab.
When I type this

import contextualSpellCheck

then I get this error:

/usr/local/lib/python3.7/dist-packages/contextualSpellCheck/contextualSpellCheck.py in <module>()
     15 
     16 
---> 17 @Language.factory("contextual spellchecker")
     18 class ContextualSpellCheck(object):
     19     """

AttributeError: type object 'Language' has no attribute 'factory'

To Reproduce

#Steps to reproduce the behavior:
1. Init spacy model '...'
2. Add contextualSpellCheck '....'
3. supply the sentence '....'
4. See error

Expected behavior
What was expected?

Version (please complete the following information):

  • contextualSpellCheck [e.g. 0.3.0]
  • Spacy: [e.g. 2.3.2]
  • transformers [e.g. 3.1.0]

Additional information
Add any other information about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.