Giter VIP home page Giter VIP logo

spelling's Introduction

Spelling correction based on pretrained transformer models

Purpose

This is an attempt to create a model that is able to fix spelling errors and common typos.

An english work in progress model and interactive demo can be found here and a german version here.

Install

  1. python -m venv venv
  2. source venv/bin/activate
  3. pip install -r requirements.txt

Generate Training Data

To generate the training data simply run these two scripts:

  1. sh convert_leipzig_data.sh
  2. python generate_dataset.py

Optional: If you want to combine multiple languages you can edit and run the "combine.sh" script.

By default this will create the english dataset. To switch to a different language, you need to change the language tag in those two scripts.

How to train a model:

For english run sh train_bart_model.sh or train_de_bart_model.sh for the german model.

Contribute:

This is an open research project, improvements and contributions are welcome. If we achive promising results, we will publish them in a more formal way (paper). All contributers will be recognized.

Open Questions

  • How do evaluate the quality of the model, apart from using CER on syntactic data?
  • What are good data sets to train on?

Possible Datasets:

spelling's People

Contributors

julienbrochier avatar oliverguhr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spelling's Issues

OverflowError: out of range integral type conversion attempted

Hi, i'm replicating your training shell just like readme said with sh train_bart_model.sh command. And this error apear al the end.

{'loss': 0.0193, 'grad_norm': 0.06763239204883575, 'learning_rate': 2.983362019506598e-06, 'epoch': 2.98}
{'loss': 0.0197, 'grad_norm': 0.07400441914796829, 'learning_rate': 1.835915088927137e-06, 'epoch': 2.99}
{'loss': 0.0201, 'grad_norm': 0.07796286791563034, 'learning_rate': 6.884681583476765e-07, 'epoch': 2.99}
{'train_runtime': 8481.4381, 'train_samples_per_second': 105.243, 'train_steps_per_second': 0.411, 'train_loss': 0.0540917304825899, 'epoch': 3.0}
100%|██████████████████████████████████████████████████████████████████████████████| 3486/3486 [2:21:21<00:00, 2.43s/it]
[WARNING|configuration_utils.py:447] 2024-03-30 18:17:42,674 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
***** train metrics *****
epoch = 3.0
train_loss = 0.0541
train_runtime = 2:21:21.43
train_samples = 297536
train_samples_per_second = 105.243
train_steps_per_second = 0.411
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:08<00:00, 21.08s/it]Traceback (most recent call last):
File "/var/www/nlp/spelling/run_summarization.py", line 708, in
main()
File "/var/www/nlp/spelling/run_summarization.py", line 650, in main
metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/var/www/nlp/spelling/run_summarization.py", line 590, in compute_metrics
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3785, in batch_decode
return [
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3786, in
self.decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3825, in decode
return self._decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:09<00:00, 24.91s/it]

My enviroment is ubuntu 20.04 , 32GB RAM 48Cores, RTX4080.
Sat Mar 30 16:50:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:81:00.0 On | N/A |
| 53% 68C P2 196W / 320W | 10981MiB / 16376MiB | 79% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3901 G /usr/lib/xorg/Xorg 144MiB |
| 0 N/A N/A 4070 G /usr/bin/gnome-shell 66MiB |
| 0 N/A N/A 6775 G ...3/usr/lib/firefox/firefox 11MiB |
| 0 N/A N/A 15644 G ...on=20240329-134507.235000 58MiB |
| 0 N/A N/A 32210 C python 10694MiB |
+-----------------------------------------------------------------------------+

And mi last checkpoint was: 3000.
I don't know if the 3 process was finish.
Thanks in advance
Martín

Fail to reproduce results

I'm trying to reproduce results of the English model before making a French one.
Loss stabilizes around 2.0 after 0.2 epochs, slowly decreasing to 1.7.

I'm using provided data and the script generate_dataset.py for dataset generation.

Training parameters are default ones, except for batch sizes that I had to decrease from 4 to 2:

  • learning_rate: 0.0003
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1.0

Framework versions:

  • Transformers 4.24.0
  • Pytorch 1.10.0+cu113
  • Datasets 2.6.1
  • Tokenizers 0.13.2

I had to manually download model and replace name by path inside train_bart_model.sh.
Model and tokenizer configurations seem to be well retrieved. I've checked it by comparing files config.json and tokenizer_config.json of our models, generated automatically after training.
The only change is my transformers' version being "4.24.0" instead of "4.19.0.dev0".

I had to convert some files to txt format to share them.
README.md
train_bart_model.txt
config.txt
tokenizer_config.txt

Are there things I'm doing wrong?

Spell Correction sequence issue with BART based Model

Hey Oliver,

I used this hugging face model oliverguhr/spelling-correction-english-base. i fine-tuned this on my own data(medical data). But i have observed that BART messes up with the input sequence. For example.
I need to consult a Geriatric Psychiatrists inside San Diego --> ['I need to consult a Geriatric Psychiatrists near by San Jose']
I want an opinion from Chiropractic in 90048 --> ['I want an Audio from Cheropractic in 9 January']

Is there any way where i can correct this type of issue and force model to give corrected sentence in input sequence format? Any help would be highly appreciated..

Training data origins

Thank you for sharing this project. The already trained English model works really well!

I'm working on a version for French language.

I have some questions about wiki.en.train.csv and wiki.en.test.csv files needed in combine.sh.
I would like to know where you got them and if they could be replaced by taking a larger .txt Wikipedia dump.
Ex: eng_news_2020_300K-sentences instead of 100k?

Also, mentioning the Leipzig corpora website in "Possible Datasets" could help finding files for other languages.

You hugging face shared model produces different results at my end

Hi there,

since my graphics card is pretty slow, I started playing around with your hugging face provided version of the german and english models. If I go for the most minimalistic approach, I get different behaviours when using the english or german version.

Inputs (for both variants):
Input 1: das idst ein neuZr test
Input 2: Well maybler ill just write as fast as i can to get this thing stugglinh

German results:
Result 1: [{'generated_text': 'Das ist ein neuer Test. Das ist ein neuer Test. Das ist ein neuer Test. Das'}]
Result 2: [{'generated_text': 'Will maybler will just write as fast as is can to get this thing st'}]

English results:
Result 1: [{'generated_text': 'As idest in near test.'}]
Result 2: [{'generated_text': "Well, maybe I'll just write as fast as I can to get this thing stuck"}]

Code to reproduce:

from transformers import pipeline
fix_spelling = pipeline("text2text-generation", model="oliverguhr/spelling-correction-english-base")
# fix_spelling = pipeline("text2text-generation", model="oliverguhr/spelling-correction-german-base")
print(fix_spelling("das idst ein neuZr test"))
print(fix_spelling("Well maybler ill just write as fast as i can to get this thing stugglinh"))

Am I doing something completely wrong here? Do you have some insights on why this might be? Sounds pretty strange to me, that these 2 variants differ in the length of the results they present, but I am totally new to this.

Thanks alot for your awesome work,
Michiruf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.