notai-tech / fastpunct Goto Github PK

Punctuation restoration and spell correction experiments.

License: MIT License

Python 95.53% Shell 4.47%

deep-learning punctuation punctuation-correction text nlp text-correction attention auto-punctuation punctuation-restoration punctuation-marks

fastpunct's People

Contributors

Stargazers

Watchers

fastpunct's Issues

KeyError if input text size greater than around 400 chars

fastPunct punctation fails following quoted error if input text size is greater than around 400 chars.
To replicate run fastPunct.punct method with any input text string with more than 400 chars.

input_text_len 407
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 175, in punct
return decode(self.model, self.parameters, input_texts, self.allowed_extras, batch_size)
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 119, in decode
outputs = [out_dict[text] for text in input_texts_c]
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 119, in
outputs = [out_dict[text] for text in input_texts_c]
KeyError: "lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the dkdd may be dd"

Model Input

Hi!

Thank you so much for the model. I am trying to use it for a project on iOS. I have successfully converted it using CoreMLTools but now I am trying to use it. Would you please be so kind as to provide documentation on the expected input of the model? After conversion I see:

The only things I had done was add:

import coremltools
import tensorflow as tf

and

mlmodel = coremltools.convert(self.model)
mlmodel.save('punctuation.mlmodel')

(to the init)

It would be nice if in future iterations, the input/output could be a little more straightforward.

Thanks!

Improper punctuation

input = ['My name is sid and i want to become a data scientist.']
output =['Y name is Sid, and I want to become a data scientist.']

It removes M from start which is weird.

Can't detect end of sentence

Not sure if within the scope, but the model can't detect whether the text input should be separated by the period punctuation. In other words, it can't detect whether a text input actually represents two sentences.

e.g.

fastpunct.punct([
                "There are three ways to slice a fish on the left on the right and on the middle after you sliced the fish you can go to the house"], correct=True)

yields

['There are three ways to slice a fish on the left, on the right, and on the middle, after you sliced the fish, you can go to the house.']

Unwanted whitespace around quotation marks

Hi! In some cases I'm getting unwanted whitespaces around the quotation marks, e.g.:
fastpunct.punct('i m going for sure explained my friend')
-> " I'm going for sure ", explained my friend.

But
fastpunct.punct('im going for sure explained my friend')
-> "I'm going for sure", explained my friend.

German Language

How and when would it be possible to use fastPunct for german language?

Issue while initializing the module

>>> from fastpunct import FastPunct

>>> fastpunct = FastPunct()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 46, in __init__
    self.tokenizer = T5Tokenizer.from_pretrained(lang_path)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1782, in _from_pretrained
    init_kwargs = json.load(tokenizer_config_handle)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>> fastpunct = FastPunct()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 46, in __init__
    self.tokenizer = T5Tokenizer.from_pretrained(lang_path)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1782, in _from_pretrained
    init_kwargs = json.load(tokenizer_config_handle)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Please provide me with some intuition on how to overcome this issue.

Speeding things up (it is slow)

It takes fastpunct around 5-7 seconds to process one short sentence. I have:

Ubuntu 18.04 on aws
Python 3.6.9
Tensorflow 1.14.0

I am wondering what I'm doing wrong. I've tried this both as a straight cmdline call and using zerorpc, which is what I'd ultimately like it to do in order to load the training first. Right now, it's unusable as I basically need real-time results.

Thank you.

Model corrects even though correct=False

The fastPunct.punct() function takes a correct boolean argument, which is supposed to trigger text correction. However, the model corrects text even when correct is set to False. Steps to reproduce:

model = FastPunct('english', checkpoint_local_path=str(models.get_unzip('zenai-models/punct/FastPunct_2_0_2_en.zip')))
model.punct('effortless', correct=True) --> 'Easy, easy.'
model.punct('effortless', correct=False) --> 'Easy, easy.'

pypi rebuild automation

the pypi version can be automatically reverted to latest commit to develop branch

The proportion for "!" is extremely high

missing hyphen

If the input contains a hyphen, the hyphen is missing from the output. I want to keep the hyphen in the output as well. How can I do this?

exmaple
Input
Last week it was the return of the world's longest flight -- Singapore to New York JFK. This week comes another new aviation record: the world's longest flight in a single-aisle aircraft. Air Transat flight TS690 flew transatlantic from Montreal, Canada, to Athens, Greece, on Monday -- a journey of 7,600 kilometers, or 4,754 miles. So far, so normal -- except the eight-hour, 32-minute flight was performed in a narrowbody Airbus A321neoLR.
Output
Last week it was the return of the world's longest flight Singapore to New York JFK this week comes another new aviation record: the world's longest flight in a Singleaisle aircraft air Transat flight TS690 flew transatlantic from Montreal Canada to Athens Greece on Monday a journey of 7,600 kilometers, or 475.4 miles so far. So normal Except the Eighthour 32Minute flight was performed in a Narrowbody Airbus A321neolR.

Please add support for Dutch

Hi there,

Can you please add support for Dutch or tell us how to set this up ourselves/train this?

Add Tensorflow/Keras to Required libraries for Pip

If I install fastPunct without having TF/Keras installed already it won't throw an error or install the dependency, it will just fail at runtime. Looks like it should be as easy as adding to the existing list of required packages.

Prev stored model and weights

Hey!

The parameter_dict.pkl file and the fastpunct_eng_weights.h5 file are not uploaded.

No GPU env cannot run the model

I have try it on python 3.7 env without cuda.

Succefully Downloaded to: /home/ubuntu/.fastPunct_en/params.pkl
2020-07-26 00:38:16.243790: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-07-26 00:38:16.243966: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-07-26 00:38:16.244079: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (VM-0-7-ubuntu): /proc/driver/nvidia/version does not exist
2020-07-26 00:38:16.244661: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-26 00:38:16.545005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2394445000 Hz
2020-07-26 00:38:16.545635: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe014000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-26 00:38:16.545671: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/miniconda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 170, in init
self.model.load_weights(weights_path)
File "/home/ubuntu/miniconda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
...

Training on other Dataset

Hello all,
thank you for your amazing effort in this area, could you share how we can contribute to this work by adding support for new languages like Arabic? Also, could share the training code, please?.

Best regards,
Abdullah

Why a Seq2Seq instead of Classification Network?

Hey there!

I started reading about text correction with Deep Learning and most certainly read all your blog posts.

But I still wonder why you would choose a Seq2Seq network for punctuation restoration over a classification network that classifies for each token if it should be followed by some sort of punctuation. For a Seq2Seq model, you have to make sure the network does not change anything but the punctuation in the sequence, in a classification network you get this out of the box.

Does a Seq2Seq model perform better or is this easier to train for this purpose? If so, could you elaborate on why?
Or is this maybe part of a greater goal to do punctuation restoration, spelling, and grammar correction with one large Seq2Seq network?

Also, what is your network input, as far as I can tell you are not using word embeddings? I guess this is why your model checkpoint is so small.

how long does it cost when run the example？

Hi,I have ran this example with GeForce RTX 2080 Ti, but I found it need more than one minite, Is it a little slower?

Thanks!

Format of training dataset

What is the format of training data set. It will help in fine tuning the model with contextual data

Provide Frozen Model

Hi! I am hoping to use this in an iOS project. Could you convert the model to CoreML or at least provide the frozen model (.pb file)? Thanks

How to increase indices sequence length?

Hi. I am trying to restore punctuation on auto generated transcript and I see this in the console:

Token indices sequence length is longer than the specified maximum sequence length for this model (998 > 512). Running this sequence through the model will result in indexing errors

Is it possible to increase the limit?

can't able to find PRECISION, RECALL or F-SCORE in the repo

cant able to find any score for the model like PRECISION, RECALL or F-SCORE and the data on which its trained on.
If you Please give some idea how model is working in ideal condition.

Error while running the model

Hi,
I am trying to run fastpunct.py script as it is. But I am facing following issue:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-57df30c6aaa4> in <module>
----> 1 print(fastpunct.punct(["call haris mom", "oh i thought you were here", "where are you going", "in theory everyone knows what a comma is", "hey how are you doing", "my name is sheela i am in love with hrithik"]))

<ipython-input-3-de41481cee39> in punct(self, input_texts, batch_size)
    101 
    102     def punct(self, input_texts, batch_size=32):
--> 103         return decode(self.model, self.parameters, input_texts, self.allowed_extras, batch_size)
    104 
    105     def fastpunct(self, input_texts, batch_size=32):

<ipython-input-3-de41481cee39> in decode(model, parameters, input_texts, allowed_extras, batch_size)
     18         curr_char_index  = [i - extra_char_count[j] for j in range(len(input_texts))]
     19         input_encodings = np.argmax(input_sequences, axis=2)
---> 20         cur_inp_list = [input_encodings[_][curr_char_index[_]] for _ in range(len(input_texts))]
     21         output_tokens = model.predict([input_sequences, target_seq_hot], batch_size=batch_size)
     22         sampled_possible_indices = np.argsort(output_tokens[:, i, :])[:, ::-1].tolist()

<ipython-input-3-de41481cee39> in <listcomp>(.0)
     18         curr_char_index  = [i - extra_char_count[j] for j in range(len(input_texts))]
     19         input_encodings = np.argmax(input_sequences, axis=2)
---> 20         cur_inp_list = [input_encodings[_][curr_char_index[_]] for _ in range(len(input_texts))]
     21         output_tokens = model.predict([input_sequences, target_seq_hot], batch_size=batch_size)
     22         sampled_possible_indices = np.argsort(output_tokens[:, i, :])[:, ::-1].tolist()

IndexError: index 43 is out of bounds for axis 0 with size 43

Any suggestions/workarounds?

Unable to download model to EMR, fix if params and checkpoint provided, avoid redownload

The EMR needs model files to be downloaded to /tmp and default home /var/lib/livy is inaccessible to users.

So, I managed to download the content separately in my case, but even passing weights path in parameters didn't work

notai-tech / fastpunct Goto Github PK

fastpunct's People

Contributors

Stargazers

Watchers

Forkers

fastpunct's Issues

Recommend Projects

Recommend Topics

Recommend Org