Giter VIP home page Giter VIP logo

fastpunct's Introduction

fastPunct : Punctuation restoration and spell correction experiments.

Downloads

Installation:

pip install --upgrade fastpunct

Supported languages:

english

Usage:

As a python module

from fastpunct import FastPunct
# The default language is 'english'
fastpunct = FastPunct()
fastpunct.punct([
                "john smiths dog is creating a ruccus",
                "ys jagan is the chief minister of andhra pradesh",
                 "we visted new york last year in may"
                 ])
                 
# ["John Smith's dog is creating a ruccus.",
# 'Ys Jagan is the chief minister of Andhra Pradesh.',
# 'We visted New York last year in May.']

# punctuation correction with optional spell correction (experimental)

fastpunct.punct([
                  'johns son peter is marring estella in jun',
                   'kamal hassan is a gud actr'], correct=True)
                   
# ["John's son Peter is marrying Estella in June.",
# 'Kamal Hassan is a good actor.']

As a docker container

# Start the docker container
docker run -it -p8080:8080 -eBATCH_SIZE=4 notaitech/fastpunct:english

# Run prediction
curl -d '{"data": ["i was hungry i ordered a pizza my name is batman"]}' -H "Content-Type: application/json" "http://localhost:8080/sync"

# {"prediction": ["I was hungry, I ordered a pizza, my name is Batman."], "success": true}

fastpunct's People

Contributors

bedapudi6788 avatar harikodali avatar nempickaxe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastpunct's Issues

Can't detect end of sentence

Not sure if within the scope, but the model can't detect whether the text input should be separated by the period punctuation. In other words, it can't detect whether a text input actually represents two sentences.

e.g.

fastpunct.punct([
                "There are three ways to slice a fish on the left on the right and on the middle after you sliced the fish you can go to the house"], correct=True)

yields

['There are three ways to slice a fish on the left, on the right, and on the middle, after you sliced the fish, you can go to the house.']

Why a Seq2Seq instead of Classification Network?

Hey there!

I started reading about text correction with Deep Learning and most certainly read all your blog posts.

But I still wonder why you would choose a Seq2Seq network for punctuation restoration over a classification network that classifies for each token if it should be followed by some sort of punctuation. For a Seq2Seq model, you have to make sure the network does not change anything but the punctuation in the sequence, in a classification network you get this out of the box.

Does a Seq2Seq model perform better or is this easier to train for this purpose? If so, could you elaborate on why?
Or is this maybe part of a greater goal to do punctuation restoration, spelling, and grammar correction with one large Seq2Seq network?

Also, what is your network input, as far as I can tell you are not using word embeddings? I guess this is why your model checkpoint is so small.

missing hyphen

If the input contains a hyphen, the hyphen is missing from the output. I want to keep the hyphen in the output as well. How can I do this?

exmaple
Input
Last week it was the return of the world's longest flight -- Singapore to New York JFK. This week comes another new aviation record: the world's longest flight in a single-aisle aircraft. Air Transat flight TS690 flew transatlantic from Montreal, Canada, to Athens, Greece, on Monday -- a journey of 7,600 kilometers, or 4,754 miles. So far, so normal -- except the eight-hour, 32-minute flight was performed in a narrowbody Airbus A321neoLR.
Output
Last week it was the return of the world's longest flight Singapore to New York JFK this week comes another new aviation record: the world's longest flight in a Singleaisle aircraft air Transat flight TS690 flew transatlantic from Montreal Canada to Athens Greece on Monday a journey of 7,600 kilometers, or 475.4 miles so far. So normal Except the Eighthour 32Minute flight was performed in a Narrowbody Airbus A321neolR.

Issue while initializing the module

>>> from fastpunct import FastPunct

>>> fastpunct = FastPunct()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 46, in __init__
    self.tokenizer = T5Tokenizer.from_pretrained(lang_path)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1782, in _from_pretrained
    init_kwargs = json.load(tokenizer_config_handle)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>> fastpunct = FastPunct()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 46, in __init__
    self.tokenizer = T5Tokenizer.from_pretrained(lang_path)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1782, in _from_pretrained
    init_kwargs = json.load(tokenizer_config_handle)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Please provide me with some intuition on how to overcome this issue.

Provide Frozen Model

Hi! I am hoping to use this in an iOS project. Could you convert the model to CoreML or at least provide the frozen model (.pb file)? Thanks

Add Tensorflow/Keras to Required libraries for Pip

If I install fastPunct without having TF/Keras installed already it won't throw an error or install the dependency, it will just fail at runtime. Looks like it should be as easy as adding to the existing list of required packages.

Unwanted whitespace around quotation marks

Hi! In some cases I'm getting unwanted whitespaces around the quotation marks, e.g.:
fastpunct.punct('i m going for sure explained my friend')
-> " I'm going for sure ", explained my friend.

But
fastpunct.punct('im going for sure explained my friend')
-> "I'm going for sure", explained my friend.

Speeding things up (it is slow)

It takes fastpunct around 5-7 seconds to process one short sentence. I have:

Ubuntu 18.04 on aws
Python 3.6.9
Tensorflow 1.14.0

I am wondering what I'm doing wrong. I've tried this both as a straight cmdline call and using zerorpc, which is what I'd ultimately like it to do in order to load the training first. Right now, it's unusable as I basically need real-time results.

Thank you.

No GPU env cannot run the model

I have try it on python 3.7 env without cuda.

Succefully Downloaded to: /home/ubuntu/.fastPunct_en/params.pkl
2020-07-26 00:38:16.243790: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-07-26 00:38:16.243966: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-07-26 00:38:16.244079: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (VM-0-7-ubuntu): /proc/driver/nvidia/version does not exist
2020-07-26 00:38:16.244661: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-26 00:38:16.545005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2394445000 Hz
2020-07-26 00:38:16.545635: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe014000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-26 00:38:16.545671: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/miniconda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 170, in init
self.model.load_weights(weights_path)
File "/home/ubuntu/miniconda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
...

Improper punctuation

input = ['My name is sid and i want to become a data scientist.']
output =['Y name is Sid, and I want to become a data scientist.']

It removes M from start which is weird.

How to increase indices sequence length?

Hi. I am trying to restore punctuation on auto generated transcript and I see this in the console:

Token indices sequence length is longer than the specified maximum sequence length for this model (998 > 512). Running this sequence through the model will result in indexing errors

Is it possible to increase the limit?

Error while running the model

Hi,
I am trying to run fastpunct.py script as it is. But I am facing following issue:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-57df30c6aaa4> in <module>
----> 1 print(fastpunct.punct(["call haris mom", "oh i thought you were here", "where are you going", "in theory everyone knows what a comma is", "hey how are you doing", "my name is sheela i am in love with hrithik"]))

<ipython-input-3-de41481cee39> in punct(self, input_texts, batch_size)
    101 
    102     def punct(self, input_texts, batch_size=32):
--> 103         return decode(self.model, self.parameters, input_texts, self.allowed_extras, batch_size)
    104 
    105     def fastpunct(self, input_texts, batch_size=32):

<ipython-input-3-de41481cee39> in decode(model, parameters, input_texts, allowed_extras, batch_size)
     18         curr_char_index  = [i - extra_char_count[j] for j in range(len(input_texts))]
     19         input_encodings = np.argmax(input_sequences, axis=2)
---> 20         cur_inp_list = [input_encodings[_][curr_char_index[_]] for _ in range(len(input_texts))]
     21         output_tokens = model.predict([input_sequences, target_seq_hot], batch_size=batch_size)
     22         sampled_possible_indices = np.argsort(output_tokens[:, i, :])[:, ::-1].tolist()

<ipython-input-3-de41481cee39> in <listcomp>(.0)
     18         curr_char_index  = [i - extra_char_count[j] for j in range(len(input_texts))]
     19         input_encodings = np.argmax(input_sequences, axis=2)
---> 20         cur_inp_list = [input_encodings[_][curr_char_index[_]] for _ in range(len(input_texts))]
     21         output_tokens = model.predict([input_sequences, target_seq_hot], batch_size=batch_size)
     22         sampled_possible_indices = np.argsort(output_tokens[:, i, :])[:, ::-1].tolist()

IndexError: index 43 is out of bounds for axis 0 with size 43

Any suggestions/workarounds?

German Language

How and when would it be possible to use fastPunct for german language?

Training on other Dataset

Hello all,
thank you for your amazing effort in this area, could you share how we can contribute to this work by adding support for new languages like Arabic? Also, could share the training code, please?.

Best regards,
Abdullah

KeyError if input text size greater than around 400 chars

fastPunct punctation fails following quoted error if input text size is greater than around 400 chars.
To replicate run fastPunct.punct method with any input text string with more than 400 chars.

input_text_len 407
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 175, in punct
return decode(self.model, self.parameters, input_texts, self.allowed_extras, batch_size)
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 119, in decode
outputs = [out_dict[text] for text in input_texts_c]
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 119, in
outputs = [out_dict[text] for text in input_texts_c]
KeyError: "lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the dkdd may be dd"

Model Input

Hi!

Thank you so much for the model. I am trying to use it for a project on iOS. I have successfully converted it using CoreMLTools but now I am trying to use it. Would you please be so kind as to provide documentation on the expected input of the model? After conversion I see:

Screen Shot 2020-11-11 at 10 47 34 PM

The only things I had done was add:

import coremltools
import tensorflow as tf

and

mlmodel = coremltools.convert(self.model)
mlmodel.save('punctuation.mlmodel')

(to the init)

It would be nice if in future iterations, the input/output could be a little more straightforward.

Thanks!

Model corrects even though correct=False

The fastPunct.punct() function takes a correct boolean argument, which is supposed to trigger text correction. However, the model corrects text even when correct is set to False. Steps to reproduce:

model = FastPunct('english', checkpoint_local_path=str(models.get_unzip('zenai-models/punct/FastPunct_2_0_2_en.zip')))
model.punct('effortless', correct=True) --> 'Easy, easy.'
model.punct('effortless', correct=False) --> 'Easy, easy.'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.