oliverguhr / spelling Goto Github PK

View Code? Open in Web Editor NEW

58.0 5.0 16.0 64.94 MB

This is a neural spell checker

Home Page: https://huggingface.co/oliverguhr/spelling-correction-english-base

Jupyter Notebook 27.20% Python 65.82% Shell 6.98%

neural-network transformer transformer-models

spelling's Introduction

Spelling correction based on pretrained transformer models

Purpose

This is an attempt to create a model that is able to fix spelling errors and common typos.

An english work in progress model and interactive demo can be found here and a german version here.

Install

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Generate Training Data

To generate the training data simply run these two scripts:

sh convert_leipzig_data.sh
python generate_dataset.py

Optional: If you want to combine multiple languages you can edit and run the "combine.sh" script.

By default this will create the english dataset. To switch to a different language, you need to change the language tag in those two scripts.

How to train a model:

For english run sh train_bart_model.sh or train_de_bart_model.sh for the german model.

Contribute:

This is an open research project, improvements and contributions are welcome. If we achive promising results, we will publish them in a more formal way (paper). All contributers will be recognized.

Open Questions

How do evaluate the quality of the model, apart from using CER on syntactic data?
What are good data sets to train on?

Possible Datasets:

https://github.com/snukky/wikiedits
https://github.com/mhagiwara/github-typo-corpus
- Too much noise, does not work well.

spelling's People

Contributors

Stargazers

Watchers

Forkers

yli90 mpurnyn julienbrochier chrico-bu-uab trian-ctrn harishvs janjurca ishanfernandoclouda wannaphong pooryack fredmutisya dinuka-kasun-medis shaddowassassin cs4248-nlp-gec aedsen salmaahmed2828

spelling's Issues

OverflowError: out of range integral type conversion attempted

Hi, i'm replicating your training shell just like readme said with sh train_bart_model.sh command. And this error apear al the end.

{'loss': 0.0193, 'grad_norm': 0.06763239204883575, 'learning_rate': 2.983362019506598e-06, 'epoch': 2.98}
{'loss': 0.0197, 'grad_norm': 0.07400441914796829, 'learning_rate': 1.835915088927137e-06, 'epoch': 2.99}
{'loss': 0.0201, 'grad_norm': 0.07796286791563034, 'learning_rate': 6.884681583476765e-07, 'epoch': 2.99}
{'train_runtime': 8481.4381, 'train_samples_per_second': 105.243, 'train_steps_per_second': 0.411, 'train_loss': 0.0540917304825899, 'epoch': 3.0}
100%|██████████████████████████████████████████████████████████████████████████████| 3486/3486 [2:21:21<00:00, 2.43s/it]
[WARNING|configuration_utils.py:447] 2024-03-30 18:17:42,674 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
***** train metrics *****
epoch = 3.0
train_loss = 0.0541
train_runtime = 2:21:21.43
train_samples = 297536
train_samples_per_second = 105.243
train_steps_per_second = 0.411
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:08<00:00, 21.08s/it]Traceback (most recent call last):
File "/var/www/nlp/spelling/run_summarization.py", line 708, in
main()
File "/var/www/nlp/spelling/run_summarization.py", line 650, in main
metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/var/www/nlp/spelling/run_summarization.py", line 590, in compute_metrics
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3785, in batch_decode
return [
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3786, in
self.decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3825, in decode
return self._decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:09<00:00, 24.91s/it]

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3901 G /usr/lib/xorg/Xorg 144MiB |
| 0 N/A N/A 4070 G /usr/bin/gnome-shell 66MiB |
| 0 N/A N/A 6775 G ...3/usr/lib/firefox/firefox 11MiB |
| 0 N/A N/A 15644 G ...on=20240329-134507.235000 58MiB |
| 0 N/A N/A 32210 C python 10694MiB |
+-----------------------------------------------------------------------------+

And mi last checkpoint was: 3000.
I don't know if the 3 process was finish.
Thanks in advance
Martín

Fail to reproduce results

I'm trying to reproduce results of the English model before making a French one.
Loss stabilizes around 2.0 after 0.2 epochs, slowly decreasing to 1.7.

I'm using provided data and the script generate_dataset.py for dataset generation.

Training parameters are default ones, except for batch sizes that I had to decrease from 4 to 2:

learning_rate: 0.0003
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Framework versions:

Transformers 4.24.0
Pytorch 1.10.0+cu113
Datasets 2.6.1
Tokenizers 0.13.2

I had to manually download model and replace name by path inside train_bart_model.sh.
Model and tokenizer configurations seem to be well retrieved. I've checked it by comparing files config.json and tokenizer_config.json of our models, generated automatically after training.
The only change is my transformers' version being "4.24.0" instead of "4.19.0.dev0".

I had to convert some files to txt format to share them.
README.md
train_bart_model.txt
config.txt
tokenizer_config.txt

Are there things I'm doing wrong?

Spell Correction sequence issue with BART based Model

Hey Oliver,

I used this hugging face model oliverguhr/spelling-correction-english-base. i fine-tuned this on my own data(medical data). But i have observed that BART messes up with the input sequence. For example.
I need to consult a Geriatric Psychiatrists inside San Diego --> ['I need to consult a Geriatric Psychiatrists near by San Jose']
I want an opinion from Chiropractic in 90048 --> ['I want an Audio from Cheropractic in 9 January']

Is there any way where i can correct this type of issue and force model to give corrected sentence in input sequence format? Any help would be highly appreciated..

Training data origins

Thank you for sharing this project. The already trained English model works really well!

I'm working on a version for French language.

I have some questions about wiki.en.train.csv and wiki.en.test.csv files needed in combine.sh.
I would like to know where you got them and if they could be replaced by taking a larger .txt Wikipedia dump.
Ex: eng_news_2020_300K-sentences instead of 100k?

Also, mentioning the Leipzig corpora website in "Possible Datasets" could help finding files for other languages.

You hugging face shared model produces different results at my end

Hi there,

since my graphics card is pretty slow, I started playing around with your hugging face provided version of the german and english models. If I go for the most minimalistic approach, I get different behaviours when using the english or german version.

Inputs (for both variants):
Input 1: das idst ein neuZr test
Input 2: Well maybler ill just write as fast as i can to get this thing stugglinh

German results:
Result 1: [{'generated_text': 'Das ist ein neuer Test. Das ist ein neuer Test. Das ist ein neuer Test. Das'}]
Result 2: [{'generated_text': 'Will maybler will just write as fast as is can to get this thing st'}]

English results:
Result 1: [{'generated_text': 'As idest in near test.'}]
Result 2: [{'generated_text': "Well, maybe I'll just write as fast as I can to get this thing stuck"}]

Code to reproduce:

from transformers import pipeline
fix_spelling = pipeline("text2text-generation", model="oliverguhr/spelling-correction-english-base")
# fix_spelling = pipeline("text2text-generation", model="oliverguhr/spelling-correction-german-base")
print(fix_spelling("das idst ein neuZr test"))
print(fix_spelling("Well maybler ill just write as fast as i can to get this thing stugglinh"))

Am I doing something completely wrong here? Do you have some insights on why this might be? Sounds pretty strange to me, that these 2 variants differ in the length of the results they present, but I am totally new to this.

Thanks alot for your awesome work,
Michiruf