bminixhofer / wtpsplit Goto Github PK

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.

License: MIT License

Python 73.00% Shell 0.01% TeX 26.99%

sentence-boundary-detection python deep-learning machine-learning pretrained-models sentence-segmentation sentence-segmenter natural-language-processing

wtpsplit's Introduction

wtpsplit🪓

Segment any Text - Robustly, Efficiently, Adaptably⚡

This repository allows you to segment text into sentences or other semantic units. It implements the models from:

SaT — Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (state-of-the-art, encouraged).
WtP — Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić (previous version, maintained for reproducibility).

The namesake WtP is maintained for consistency. Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost. Check out the state-of-the-art results in 8 distinct corpora and 85 languages demonstrated in our Segment any Text paper.

Installation

pip install wtpsplit

Usage

from wtpsplit import SaT

sat = SaT("sat-3l")
# optionally run on GPU for better performance
# also supports TPUs via e.g. sat.to("xla:0"), in that case pass `pad_last_batch=True` to sat.split
sat.half().to("cuda")

sat.split("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]

# do this instead of calling sat.split on every text individually for much better performance
sat.split(["This is a test This is another test.", "And some more texts..."])
# returns an iterator yielding lists of sentences for every text

# use our '-sm' models for general sentence segmentation tasks
sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # optional, see above
sat_sm.split("this is a test this is another test")
# returns ["this is a test ", "this is another test"]

# use trained lora modules for strong adaptation to language & domain/style
sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")
sat_adapted.half().to("cuda") # optional, see above
sat_adapted.split("This is a test This is another test.")
# returns ['This is a test ', 'This is another test']

Available Models

If you need a general sentence segmentation model, use -sm models (e.g., sat-3l-sm) For speed-sensitive applications, we recommend 3-layer models (sat-3l and sat-3l-sm). They provide a great tradeoff between speed and performance. The best models are our 12-layer models: sat-12l and sat-12l-sm.

Model	English Score	Multilingual Score
sat-1l	88.5	84.3
sat-1l-sm	88.2	87.9
sat-3l	93.7	89.2
sat-3l-lora	96.7	94.8
sat-3l-sm	96.5	93.5
sat-6l	94.1	89.7
sat-6l-sm	96.9	95.1
sat-9l	94.3	90.3
sat-12l	94.0	90.4
sat-12l-lora	97.3	95.9
sat-12l-sm	97.4	96.0

The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via LoRA; check out the paper for details.

For comparison, here the English scores of some other tools:

Model	English Score
PySBD	69.6
SpaCy (sentencizer; monolingual)	92.9
SpaCy (sentencizer; multilingual)	91.5
Ersatz	91.4
Punkt (`nltk.sent_tokenize`)	92.2
WtP (3l)	93.9

Note that this library also supports previous WtP models. You can use them in essentially the same way as SaTmodels:

from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# similar functionality as for SaT models
wtp.split("This is a test This is another test.")

For more details on WtP and reproduction details, see the WtP doc.

Paragraph Segmentation

Since SaT are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
sat.split(text, do_paragraph_segmentation=True)

Adaptation

SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for sat-3land sat-12l. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our paper.

We also provided verse segmentation modules for 16 genres for sat-12-no-limited-lookahead.

Load LoRA modules like this:

# requires both lang_code and style_or_domain
# for available ones, check the <model_repository>/loras folder
sat_lora = SaT("sat-3l", style_or_domain="ud", language="en")
sat_lora.split("Hello this is a test But this is different now Now the next one starts looool")
# now for a highly distinct domain
sat_lora_distinct = SaT("sat-12l", style_or_domain="code-switching", language="es-en")
sat_lora_distinct.split("in the morning over there cada vez que yo decía algo él me decía algo")

You can also freely adapt the segmentation threshold, with a higher threshold leading to more conservative segmentation:

sat.split("This is a test This is another test.", threshold=0.4)
# works similarly for lora; but thresholds are higher
sat_lora.split("Hello this is a test But this is different now Now the next one starts looool", threshold=0.7)

Advanced Usage

Get the newline or sentence boundary probabilities for a text:

# returns newline probabilities (supports batching!)
sat.predict_proba(text)

Load a SaT model in HuggingFace `transformers`:

# import library to register the custom models 
import wtpsplit
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("segment-any-text/sat-3l-sm") # or some other model name; see https://huggingface.co/segment-any-text

Adapt to your own corpus via LoRA

Our models can be efficiently adapted via LoRA in a powerful way. Only 10-100 training segmented training sentences should already improve performance considerably. To do so:

Clone the repository and install requirements:

git clone https://github.com/segment-any-text/wtpsplit
cd wtpsplit
pip install -r requirements.txt
pip install adapters==0.2.1 --no-dependencies
cd ..

Create data in this format:

import torch

torch.save(
    {
        "language_code": {
            "sentence": {
                "dummy-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "dummy-dataset.pth"
)

Create/adapt config; provide base model via model_name_or_path and training data .pth via text_path:

configs/lora/lora_dummy_config.json

Train LoRA:

python3 wtpsplit/train/train_lora.py configs/lora/lora_dummy_config.json

Once training is done, provide your saved module's path to SaT:

sat_lora_adapted = SaT("model-used", lora_path="dummy_lora_path")
sat_lora_adapted.split("Some domains-specific or styled text")

Adjust the dataset name, language and model in the above to your needs.

Reproducing the paper

configs/ contains the configs for the runs from the paper for base and sm models as well as LoRA modules. Launch training for each of them like this:

python3 wtpsplit/train/train.py configs/<config_name>.json
python3 wtpsplit/train/train_sm.py configs/<config_name>.json
python3 wtpsplit/train/train_lora.py configs/<config_name>.json

In addition:

wtpsplit/data_acquisition contains the code for obtaining evaluation data and raw text from the mC4 corpus.
wtpsplit/evaluation contains the code for:
- evaluation (i.e. sentence segmentation results) via intrinsic.py.
- short-sequence evaluation (i.e. sentence segmentation results for pairs/k-mers of sentences) via intrinsic_pairwise.py.
- LLM baseline evaluation (llm_sentence.py), legal baseline evaluation (legal_baselines.py)
- baseline (PySBD, nltk, etc.) evaluation results in intrinsic_baselines.py and intrinsic_baselines_multi.py
- Raw results in JSON format are also in evaluation_results/
- Statistical significane testing code and results ara in stat_tests/
- punctuation annotation experiments in punct_annotation.py and punct_annotation_wtp.py (WtP only)
- extrinsic evaluation on Machine Translation in extrinsic.py (WtP only)

Ensure to install packages from requirements.txt beforehand.

Supported Languages

Table with supported languages

iso	Name
af	Afrikaans
am	Amharic
ar	Arabic
az	Azerbaijani
be	Belarusian
bg	Bulgarian
bn	Bengali
ca	Catalan
ceb	Cebuano
cs	Czech
cy	Welsh
da	Danish
de	German
el	Greek
en	English
eo	Esperanto
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fi	Finnish
fr	French
fy	Western Frisian
ga	Irish
gd	Scottish Gaelic
gl	Galician
gu	Gujarati
ha	Hausa
he	Hebrew
hi	Hindi
hu	Hungarian
hy	Armenian
id	Indonesian
ig	Igbo
is	Icelandic
it	Italian
ja	Japanese
jv	Javanese
ka	Georgian
kk	Kazakh
km	Central Khmer
kn	Kannada
ko	Korean
ku	Kurdish
ky	Kirghiz
la	Latin
lt	Lithuanian
lv	Latvian
mg	Malagasy
mk	Macedonian
ml	Malayalam
mn	Mongolian
mr	Marathi
ms	Malay
mt	Maltese
my	Burmese
ne	Nepali
nl	Dutch
no	Norwegian
pa	Panjabi
pl	Polish
ps	Pushto
pt	Portuguese
ro	Romanian
ru	Russian
si	Sinhala
sk	Slovak
sl	Slovenian
sq	Albanian
sr	Serbian
sv	Swedish
ta	Tamil
te	Telugu
tg	Tajik
th	Thai
tr	Turkish
uk	Ukrainian
ur	Urdu
uz	Uzbek
vi	Vietnamese
xh	Xhosa
yi	Yiddish
yo	Yoruba
zh	Chinese
zu	Zulu

For details, please see our Segment any Text paper.

Citations

For the SaT models, please kindly cite our paper:

@article{frohmann2024segment,
    title={Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation},
    author={Frohmann, Markus and Sterner, Igor and Vuli{\'c}, Ivan and Minixhofer, Benjamin and Schedl, Markus},
    journal={arXiv preprint arXiv:2406.16678},
    year={2024},
    doi={10.48550/arXiv.2406.16678},
    url={https://doi.org/10.48550/arXiv.2406.16678},
}

For the library and the WtP models, please cite:

@inproceedings{minixhofer-etal-2023-wheres,
    title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation",
    author = "Minixhofer, Benjamin  and
      Pfeiffer, Jonas  and
      Vuli{\'c}, Ivan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.398",
    pages = "7215--7235"
}

Acknowledgments

This research was funded in whole or in part by the Austrian Science Fund (FWF): P36413, P33526, and DFH-23, and by the State of Upper Austria and the Federal Ministry of Education, Science, and Research, through grants LIT-2021-YOU-215. In addition, Ivan Vulic and Benjamin Minixhofer have been supported through the Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137) awarded to Ivan Vulić. This research has also been supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). This work was also supported by compute credits from a Cohere For AI Research Grant, these grants are designed to support academic partners conducting research with the goal of releasing scientific artifacts and data for good projects. We also thank Simone Teufel for fruitful discussions.

For any questions, please create an issue or send an email to [email protected], and I will get back to you as soon as possible.

wtpsplit's People

Contributors

Stargazers

Watchers

wtpsplit's Issues

Porting to Android

Hi, I am trying to run onnx model on Android and have sharted with the steps like is described there: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

import onnx
import caffe2.python.onnx.backend
from onnx import helper

# Load the ONNX GraphProto object. Graph is a standard Python protobuf object
model = onnx.load("model.onnx")

Unfortinately I receive an error:

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
<ipython-input-8-0e15f43f99e0> in <module>()
      1 # Load the ONNX GraphProto object. Graph is a standard Python protobuf object
----> 2 model = onnx.load("model.onnx")
      3 

2 frames
/usr/local/lib/python3.6/dist-packages/onnx/__init__.py in _deserialize(s, proto)
     95                          '\ntype is {}'.format(type(proto)))
     96 
---> 97     decoded = cast(Optional[int], proto.ParseFromString(s))
     98     if decoded is not None and decoded != len(s):
     99         raise google.protobuf.message.DecodeError(

DecodeError: Error parsing message

Could you please what could be the issue? I use EN model and Google Colab

Scoring metric, does definition make sense?

I looked more into the scoring metric and noticed something. You score based on the indices of predicted sentences. However if you for example split two sentences and predict two arbitrary indeces (true indeces that is), lets say [ 23, 83] the scoring is only based on they index 23. Why is that? Because we score the splits, two sentences equal to one split so while 23 marks the split 83 only marks the end of the sentence. This makes sense in a way ... or maybe not - i am not sure. Because if you think about it, even if the algorithm does not recognize the last symbol as the end of a sentence it will still give the index 83, since it is given by [len(s) in predicted_sentences]. Lets assume you have three sentences now, which have the true indeces [23, 83, 140, 158] and lets say for some reason wtpsplit cant recognize the middle sentence. It would return [23, 140, 158] and a smaller f1 score. However if I would input the sentences separately like this [23, 83] and [140, 158] the f1 score would be 1, because 83 and 158 are never considered for scoring. This makes the score dependent on the number of sentences. For example if I score an dataset by aggregating two lines (which represent a sentence) in a loop the results would be much better than if I did with with 5 lines or even 10. There is also an risk involved in losing data, except you take the last sentence of each iteration into the next. Sorry for the text blob, but maybe you guys know a best practice for such a problem :)

InconsistentVersionWarning Issue everytime I start wtp

Hi, everytime I run a wtp object I get the following warning;

InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.0. This might lead to breaking code or invalid results. Use at your own risk. For more info pl
ease refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(

and ends. Any ideas on how to tackle this?

Can't run it on macOS

I want to give the module a try, but I am getting the following error by running the basic example on the readme. Module was installed with pip.

splitter = NNSplit.load("en")
AttributeError: type object 'NNSplit' has no attribute 'load'

Incorrect splits

Please report here issues similar to #18, i. e. text where it is easy for humans to see the correct split but NNSplit gets it wrong.

I'm not entirely satisfied with the quality of the models yet, such cases might help improve it.

Can nnsplit use an http proxy?

For some reasons I can't directly fetch the resources required by nnsplit. For example,

splitter = NNSplit.load("fr")
nnsplit.ResourceError: network error fetching "model.onnx" for "fr"

I'm pretty sure this is the local network issue because when I switch to other networks, it works.

So I'm wondering if there's any method to use an http proxy instead of directly sending a network request? I've tried to set the environment variables like http_proxy and https_proxy on windows and they didn't work.

ImportError in Python (NNSplit)

Hi, I was trying the simple example in Python from the documentation and I'm getting an ImportError:

from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
   print(sentence)

When executing this example I'm getting the following error:

Traceback (most recent call last):
  File "nnsplit.py", line 1, in <module>
    from nnsplit import NNSplit
  File "G:\OneDrive\projects\s\nnsplit.py", line 1, in <module>
    from nnsplit import NNSplit
ImportError: cannot import name 'NNSplit' from partially initialized module 'nnsplit' (most likely due to a circular import) (G:\OneDrive\projects\s\nnsplit.py)

I have installed the packages in a new conda environment, executing pip list installed I have:

pip list installed
Package         Version
--------------- -------------------
certifi         2020.12.5
nnsplit         0.5.7.post0
numpy           1.20.3
onnxruntime     1.7.0
onnxruntime-gpu 1.7.0
pip             21.1.1
protobuf        3.17.1
setuptools      52.0.0.post20210125
six             1.16.0
tqdm            4.61.0
wheel           0.36.2
wincertstore    0.2

rust installation failure on 0.3.1

Followed your Rust instructions and am hitting this failure:

   Compiling nnsplit v0.3.1
error: couldn't read /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/nnsplit-0.3.1/src/../../models.csv: No such file or directory (os error 2)
  --> /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/nnsplit-0.3.1/src/model_loader.rs:10:23
   |
10 |         let raw_csv = include_str!("../../models.csv");
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

error: could not compile `nnsplit`.

When I look inside .cargo/registry/src/github.com-1ecc6299db9ec823/nnsplit-0.3.1/ I see there is no models.csv. If I manually add it by copying down raw, it still fails on the path.

Simplified Chinese model does not detect sentence boundaries correctly

Hi,

I have tried Simplified Chinese model on demo page and it seems that sentence boundary and tokens detection are not correct.

I have 2 ideas why that could happen:

Period in Chinese is 。
There are no white spaces between words. Possibly it is better to use something like https://github.com/voidism/pywordseg to split on words as a preprocessing step

It looks like issue 2. causes that tokens are also not detected correctly. I have compared with https://github.com/voidism/pywordseg results and they do not match. But I am not sure here, because I have compared Spacy, pywordseg and Stanford Word Segmenter and all of them provide different results

Publish 0.3.x python wheels for Linux/non-macOS platforms

After running into #13, I tried to use the Python bindings instead. It worked, but I noticed that installed version 0.2.2 (saw it didn't match up with the documentation in the README).

After digging into it a little bit, I saw that 0.2.2 was the last release with a platform-agnostic wheel available. All 0.3.x wheels seem to be built specifically for macOS, and are not installable on my Linux/Ubuntu machine.

I'm wondering if there are some easy adjustments that could be made to make publishing wheels for all platforms again possible (or at least Linux/Ubuntu 😇 )?

[Bug] Pandas not listed in `install_requires`, resulting in import error

When I tried to import wtpsplit to try it out, the program failed with an import error.

It appears that wtpsplit uses pandas (and imports it at the top level), but does not list it in setup.py, so it doesn't automatically get installed when wtpsplit is installed.

I can make a PR if you'd like (though I know it is a very small thing, lol).

Model(s) use word capitlisation to segment

Hi,

The models tested in English and few other languages seem to rely on capitalisation to detect sentence boundaries. On our dataset, if the captilisaton at start of target sentences are retained the F1 score is as high as .90 for certain model+style+threshold. If the sentence boundary starts are lowercased, then the best of F1 score drops to 0.3

Example:
with 'wtp-bert-mini' the sentence 'We are running a test We should should get two sentences' will split but 'We are running a test we should should get two sentence' won't split.

I am not sure if this is an expected behavour or an issue.

Thanks

Rust bindings for wtpsplit

I would love this to have Rust bindings again like NNSplit had :)

get_threshold does not work

Hi!
I'm tring to test functionality from README.md this step

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l")

wtp.get_threshold("en", "ud")

AttributeError                            Traceback (most recent call last)

<ipython-input-41-b7dd80e9f417> in <cell line: 1>()
----> 1 wtp.get_threshold("en", "ud")

1 frames

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
   1612             if name in modules:
   1613                 return modules[name]
-> 1614         raise AttributeError("'{}' object has no attribute '{}'".format(
   1615             type(self).__name__, name))
   1616 

AttributeError: 'LACanineForTokenClassification' object has no attribute 'get_threshold'

Colab:
torch 2.0.1+cu118
huggingface-hub-0.15.1
safetensors-0.3.1
skops-0.7.post0
tokenizers-0.13.3
transformers-4.30.2
wtpsplit-1.0.1

Missing file in NPM package?

I'm trying to import nnsplit in a JavaScript project, and webpack is failing with:

./node_modules/nnsplit/nnsplit.bundle/nnsplit_javascript_bg.wasm
Module not found: Can't resolve './nnsplit_javascript_bg.js' in '/tmp/experiment/node_modules/nnsplit/nnsplit.bundle'

Looking in node_modules/nnsplit/nnsplit.bundle, indeed the file nnsplit_javascript_bg.js is referenced by package.json, but missing from the filesystem.

(Not sure though whether that's the real culprit, as the nodejs example seems to work as intended.)

ValueError when using wtp-canine-s-12l-no-adapters on Danish

When using wtp-canine-s-12l-no-adapters for Danish with style "ud", I encounter a ValueError on one specific text.

Specs:

Python version: 3.9.15

Steps to reproduce:

In a clean environment, I only install wtpsplit (and missing requirement pandas).

text = 'Vinderne af Club Syds quiz er fundet\n06 februar 2012 kl. 16.58\nVinderne af Club Syds quiz er fundet. Stort tillykke til de tre vindere af en iPad. Quizzen fortsætter i denne uge, hvor præmierne er tre flotte fladskærms-TV.\nSidste uges rigtige svar var:\nFredericia Stadion (Monjasa Park)\nPræmierne er en iPad til hver af de heldige vindere, og de er nu på vej til:\nJørgen Ladegaard\ni Asperup\nIngelise Smith Hansen\ni Haderslev\nog \nGudrun Zederkof\nLunderskov\n'
model = WtP("wtp-canine-s-12l-no-adapters")
sents = model.split(text, lang_code="da", style="ud")

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 285, in split
    return next(
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 365, in _split
    for text, probs in zip(
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 232, in _predict_proba
    outer_batch_logits = extract(
  File "/.env/lib/python3.9/site-packages/wtpsplit/extract.py", line 175, in extract
    out = model(
  File "/.env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1521, in forward
    outputs = self.canine(
  File "/.env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1145, in forward
    molecule_attention_mask = self._downsample_attention_mask(
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1061, in _downsample_attention_mask
    batch_size, char_seq_len = char_attention_mask.shape
ValueError: too many values to unpack (expected 2)

node.js: unload wasm module

Hello, thanks for this project!
I'm trying to unload the wasm module in my node code:

const nnsplit = require("nnsplit");

function run() {
    nnsplit.NNSplit.new("/root/nnsplit_models/en/model.onnx")
    .then(splitter => {
        return splitter.split(["This is a test This is another test."])
    })
    .then(results => {
        let splits = results[0];
        console.log(splits.parts.map((x) => x.text)); // to log sentences, or x.parts to get the smaller subsplits
    })
    .catch(error => {
        console.error(error);
    })
}
run();

when running this script, the console keeps opened, since some resource must be released. I assume is the tractjs model that should be released in some way (likely using destroy())

Thanks

The split does not look right for this particular case.

Hi,

My sentence is as shown below:
What's working and what needs to change? Not everybody Dr.Jones, has the opportunity to watch themselves after they've had a date to see what they're doing right or wrong, so that you will only know what to do in the next day. Yeah, but it's such an important exercise that they needed to do. Last week they went on their first date, which is a huge step for our single wives, and a great time for us to watch your dates..

When I split it using nnsplit the split sentences are shown below:

What's working and what needs to change?
Not everybody Dr.
Jones, has the opportunity to watch themselves after they've had a date to see what they're doing right or wrong, so that you will only know what to do in the next day.
Yeah, but it's such an important exercise that they needed to do.
Last week they went on their first date, which is a huge step for our single wives, and a great time for us to watch your dates..

I don't think this is right. Will you please let me know if these splits can be improved.

Python bindings frequently segfault

Python bindings frequently cause a segfault. This has apparently been resolved by PyO3/pyo3@7b1e8a6 which is not yet released.

Current solution: Depend on a pyo3@master:

pyo3 = {git = "https://github.com/PyO3/pyo3", rev = "e6f8fa7"}

and a rust-numpy fork which uses the same rev:

numpy = {git = "https://github.com/bminixhofer/rust-numpy"}

in bindings/python/Cargo.toml.

This should be changed as soon as PyO3 0.10.x is released.

wtp-canine-s-1l-no-adapters - missing mixtures.skops

Hi,

There isn't a "mixtures.skops" file for "wtp-canine-s-1l-no-adapters". This causes an error while passing the "style" argument. All other "-no-adapters" model have the file and work fine.

Error when load model

When I run this command:
splitter = NNSplit.load('en')
This error occured:

thread '<unnamed>' panicked at 'Once instance has previously been poisoned', library/std/src/sync/once.rs:394:21
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Once instance has previously been poisoned

Python 3.10 wheel

It seems there is no wheel for Python 3.10 on PyPI yet: https://pypi.org/project/nnsplit/#files

Is there anything I can do to help with 3.10 support?

Inconsistent results with same sentences

I found that the usage of your splitter model gives very inconsistent results. Take for example the amharic language (lang_code="am").

If I take for example these two sentences from the flores 200 test dataset:

ሪንግ ከተፎካካሪ የደህንነት ኩባንያም ADT ኮርፖሬሽን፣ ጋር ክስ መስርቷል።
አንደ የሙከራ ክትባት የኢቦላን ገዳይነት ቢቀንስም፣ እስካሁን፣ ነባር በሽታዎችን እንዲያክም አመቺ ሆኖ የቀረበ ምንም መድሃኒት የለም።

If i concatenate these two strings and feed them into wtp.split() it will produce 10 sentences:

ሪንግ
ከተፎካካሪ
የደህንነት ኩባንያም ADT
ኮርፖሬሽን፣ ጋር ክስ መስርቷል።
አንደ የሙከራ ክትባት
የኢቦላን ገዳይነት ቢቀንስም፣
እስካሁን፣
ነባር በሽታዎችን እንዲያክም
አመቺ ሆኖ የቀረበ
ምንም መድሃኒት የለም።

However if I give the algorithm more (7) sentences and concatenate them into a string it splits them all perfectly:
(Notice that the two sentences in the example above are included in the text below, sentence 3 and 4)

1.ፓናሉ ለንግድ መጀመር ገንዘብ በተከለከለበት በ2013 በሻርክ ታንክ ምዕራፍ ላይ ከቀረበ ወዲህ ሽያጭ እንደጨመረ ሲሚኖፍ ተነግሯል።
2. በ2017 መጨረሻ ላይ፣ ሲሚኖፍ በሽያጭ የቴሌቪዥን ጣቢያ ላይ ቀርቦ ነበር።
3. ሪንግ ከተፎካካሪ የደህንነት ኩባንያም ADT ኮርፖሬሽን፣ ጋር ክስ መስርቷል።
4. አንደ የሙከራ ክትባት የኢቦላን ገዳይነት ቢቀንስም፣ እስካሁን፣ ነባር በሽታዎችን እንዲያክም አመቺ ሆኖ የቀረበ ምንም መድሃኒት የለም።
5. አንድ የጸረ እንግዳ አካል፣ ZMapp፣ በዚህ መስክ ላይ ተስፋን አሳይቶ ነበር፣ ግን መደበኛ ጥናቶች ሞትን ለመከላከል ከተፈለገው ጥቅም ያነሰ እንዳለው ያሳያል።
6. በPALM ሙከራ፣ ZMapp እንደ መቆጣጠሪያ ያገለግል ነበር፣ ማለት ተመራማሪዎች እንደ መነሻ ይጠቀሙበት እና ከሌሎች ሶስት ህክምናዎች ጋር ያነጻጽሩታል።
7. የአሜሪካ ጂምናስቲ የዩናይትድ ስቴትስ ኦሎፒክ ኮሚቴ ደብዳቤ ይደግፋል እናም በሙሉ አስፈላጊነት የኦሎምፒክ ቤተሰብ ደህንነቱ የተጠበቀ አካባቢ ለሁሉም አትሌቶቻችን ማስተዋወቅ እንዳለበት ይቀበላል።

Can you explain this behaviour? It makes your algorithm very unpredictable to be honest and I fear this problem is also present in other languages if I did not make any mistake. I called the splitter with the appropiate language at all times. Let me know what you think of this.

Build Python 3.9 Wheels

When trying to install into python3.9 it will not install a version later than 0.2.2. I am not certain but I believe that this is because wheels are only build for versions 3.6, 3.7 and 3.8. Would it be possible to add wheels for the 3.9 version?

PanicException with 0.4.*

After installing the new version, I get the following exception when running NNSplit.load("en").

PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: Py(0x5632cc1b1140, PhantomData) }

This occurs both with and without onnxruntime-gpu installed.

How to use Universal Dependencies style ?

I load model from local path,:

from wtpsplit import WtP
wtp = WtP("/data/share/HuggingFace/custom/benjamin/wtp-bert-mini/")

wtp.split("This is a test This is another test.", lang_code="en", style="ud")

it returns :
ValueError: This model does not have any associated mixtures. Maybe they are missing from the model directory?

where can I download associated mixtures?

Split Object to proper string

How can I convert split object and get segmented sentences from a text (without line breaks or full stops)

Recursion in init

Hi - hopefully this is just something I am constructing incorrectly but I am getting recursion in init which results in an error with wtpsplit==1.2.0.

My code is running inside joblib but is just doing:

self.sentence_splitter = WtP("wtp-canine-s-12l")
for sentence in self.sentence_splitter.split(text, lang_code=self.language):
yield sentence

And I get:

process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.8/dist-packages/wtpsplit/init.py", line 115, in getattr
return getattr(self.model, name)
File "/usr/local/lib/python3.8/dist-packages/wtpsplit/init.py", line 115, in getattr
return getattr(self.model, name)
File "/usr/local/lib/python3.8/dist-packages/wtpsplit/init.py", line 115, in getattr
return getattr(self.model, name)
[Previous line repeated 988 more times]
RecursionError: maximum recursion depth exceeded

Let me know if any thoughts

Thanks

Jon

Opus100 FR not in mixtures

Hi,

Table 8 in the paper indicates that the training data includes Opus100 FR. However, it seems to not be present in mixtures I checked.

wtp = WtP("wtp-canine-s-12l")
wtp.split("Bonjour", lang_code="fr", style="opus100")

Language wishlist

A list of languages currently considered for training and adding to the Repo:

I'll see if I can train models for languages on this list. If you want to speed it up, just train it yourself following https://github.com/bminixhofer/nnsplit/blob/master/train/train.ipynb :)

Could not find a mixture for the Universal Dependencies (UD) style in Thai language

I have been trying to use a wtpsplit in the Thai language by using the 'ud' style as :

# specify language code to be 'th' and style='ud' according to the paper
wtp.split(text, lang_code="th", style='ud')

However, there returned an error that:

ValueError: Could not find a mixture for the style 'ud'.

I also checked in the language_info.csv file and found that the UD style is also supported in the Thai language as UD_Thai-PUD

I have tried on another supported style such as OPUS100 and found that it is usable, except for the UD style that returned me an error. Did this is an error or did I understand something wrong?

Thank you

show progress

I'm currently using nnsplit on a fairly big dataset. Is it possible to track progress on a long list of inputs?

use concrete error types

There is a bit of an issue with the error bounds in rust when being as lax as Box<dyn Error> - most error frameworks expect the error type bounded to be Error + Send + 'static.

For a library it's common to implement a custom error type which is then exposed to the user, which wraps all possible internal error types. Currently the tool of choice (imho) is thiserror.

Moving to concrete error types rather than dyn boxes would be auch appreciated step.

Error loading model to GPU

Version: 1.2.0

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l-no-adapters")
wtp.to("cuda")

throws,

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
test.ipynb Cell 6 in 4
      [1] from wtpsplit import WtP
      [3] wtp = WtP("wtp-canine-s-12l-no-adapters")
----> [4] wtp.to("cuda")

File (site-packages/wtpsplit/__init__.py:115), in WtP.__getattr__(self, name)
    114 def __getattr__(self, name):
--> 115     return getattr(self.model, name)

AttributeError: 'PyTorchWrapper' object has no attribute 'to'

The issue is due to the nested wtp.model.model not being handled by the __getattr__ method.
Calling wtp.model.model.to("cuda") works.

Control where the model is downloaded too?

Hi,
This is more of a minor feature request. I'm trying to use NNSplit in a container, which has a read-only file system except for the /tmp dir. It would be groovy if one could provide a local path to load the model from/download to. Perhaps this is in the python interface already but i couldn't see it.

I know you can specify a path when calling NNSplit() but this gets more complcated as I'm including it in a modele that then gets included in another project.

Anyway, nice work and thanks!

EN model training in Google Colab

Hello, with use of Google Colab I was able to train a model for Russian language.
But when I start training a model for English language with trainer.fit(model), it floods output (hundreds of messages):

[W108] The rule-based lemmatizer did not find POS annotation for the token ')'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Marjorie'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'But'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'and'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Daw'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'It'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token '('. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'made'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.

and connection to runtine is lost

Do you have an idea how I could get rid of this output mesages?

Link to Colab: https://colab.research.google.com/drive/1xjrD1ZvkzLypbuYaywkf7yHy9UAjYq-g?usp=sharing
(the model is a bit chaged to have int8 input)

Performance of Rust crate

I am getting fairly poor performance in release mode (CPU)… 2kB/s.

Is there a guide on using the GPU?

JS Usage example & Node.js support

Would it be possible to add a simple "usage" for node.js?

Unusual splits in short sentence

Hello, thank you for your great work!

I noticed unusual splits in a short sentence. I assume this is due to the name.

Is there any way to detect this?

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l")

issue = """‘Make sure it does,’ Vaughn said."""
expected = ["""‘Make sure it does,’ Vaughn said."""]

wtp.split(issue, lang_code="en")

# wrong ['‘Make sure it does,’ ', 'Vaughn ', 'said.']

wtp.split(issue,  lang_code="en",  style="ud")

# wrong ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(issue,  lang_code="en",  style="opus100")

# correct ['‘Make sure it does,’ Vaughn said.']

wtp.split(issue,  lang_code="en",  style="ersatz")

# wrong ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(issue,  lang_code="en", threshold=0.99)

# correct ['‘Make sure it does,’ Vaughn said.']

Tested: Version 1.0.1 , colab CPU

Security: update version of tract-onnx

This security vulnerability:

https://rustsec.org/advisories/RUSTSEC-2021-0073.html

is fixed in prost==0.8.0, which is included in a recent new release of tract-onnx: https://github.com/sonos/tract/releases/tag/0.15.2

Would it be possible to do a new release with the tract-onnx dependency bumped?

ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found

i run it on centos7 and python3.8.3，and i just want to run it on cpu not gpu. I met follows error：

>>> import nnsplit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/zyb/miniconda3/lib/python3.8/site-packages/nnsplit.cpython-38-x86_64-linux-gnu.so)

Apple Silicon / Arm support

What will it take to get support on Apple Silicon / ARM? I'm happy to help out with testing if that can be useful.

More language support.

Hi, a lot of thanks to your project.

In the README, it says:

Alternatively, you can also load your own model.

Where can I can find models for other languages except English and German? Or could you tell me how to train my own model for other languages step by step? I'm happy to contribute for providing more models.

Thank you,
Guangrui Wang

For GPU, ONNX WtP model is around 2x slower than PyTorch.

import time
from wtpsplit import WtP

wtp = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])


def make_sentence(seg):
  sentences = wtp.split(seg, lang_code="en", style="ud", threshold=0.975)
  sentences = [x.strip() for x in sentences]
  return(sentences)

timelist_fox = []

for i in range(20):
  start = time.time()
  input_text = "The quick brown fox jumps over the lazy dog. El zorro marrón rápido salta sobre el perro perezoso. I went to see the p. t. barnum circus today!"
  sentences = make_sentence(input_text)
  end = time.time()
  print(sentences)
  print("Runtime for sentence segmentation", end - start)
  timelist_fox.append(end - start)

print()
# Get average runtime
print("Average runtime for sentence segmentation", sum(timelist_fox)/len(timelist_fox))

And I get

['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09200406074523926
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.15698647499084473
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.07426166534423828
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.07866954803466797
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09979438781738281
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.08975934982299805
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09622359275817871
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09634947776794434
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.07365036010742188
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.10837149620056152
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.0805506706237793
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.08892273902893066
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.04485893249511719
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.10665750503540039
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.05337262153625488
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.040402889251708984
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.03861117362976074
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.04022550582885742
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.03824734687805176
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.0443730354309082

Average runtime for sentence segmentation 0.07711464166641235

Whereas if I replace the line wtp = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"]) with

wtp = WtP("wtp-bert-mini")
wtp.half().to("cuda")

I get

['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 3.6466240882873535
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.021060943603515625
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.014858007431030273
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.015185832977294922
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.021528959274291992
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.014949560165405273
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.014830350875854492
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.013895034790039062
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.013033628463745117
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.017659902572631836
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.018916606903076172
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.013854742050170898
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.025988340377807617
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.01674795150756836
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.015290498733520508
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.012728214263916016
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.016968250274658203
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.01860976219177246
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.021869421005249023
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.01503300666809082

Average runtime for sentence segmentation 0.19848165512084961

Although PyTorch implementation is on average slower because outlier from the first run, removing the initial outlier from the first PyTorch run makes it on average faster than ONNX run.

I see the inputs are not bounded to GPU in (https://github.com/bminixhofer/wtpsplit/blob/main/wtpsplit/extract.py). Could you please try to binding them to see if it faster?

Use ONNX models everywhere due to TorchScript instability

Hey, there! I was trying to run the Rust example from the README, but got the following error on a cargo run:

Error: Compat { error: TorchError { c_error: "The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n  File \"code/__torch__/torch/nn/quantized/dynamic/modules/rnn.py\", line 195, in __setstate__\n    state: Tuple[Tuple[Tensor, Optional[Tensor]], bool]) -> None:\n    _72, _73, = (state)[0]\n    _74 = ops.quantized.linear_prepack(_72, _73)\n          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n    self.param = _74\n    self.training = (state)[1]\n\nTraceback of TorchScript, original code (most recent call last):\n  File \"/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/dynamic/modules/rnn.py\", line 29, in __setstate__\n    @torch.jit.export\n    def __setstate__(self, state):\n        self.param = torch.ops.quantized.linear_prepack(*state[0])\n                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n        self.training = state[1]\nRuntimeError: Didn\'t find engine for operation quantized::linear_prepack NoQEngine\n" } }

Let me know it there is any more info you need for debugging!

Any string that isn't a multiple of 4 causes an assert failure

Hi,

Any string that isn't a multiple of 4 causes an assert failure at line 548 in models.py
"assert char_encoding.shape[1] % self.conv.stride[0] == 0"

stride is intialised to config.downsampling_rate (4) in modeling_canine.py in transformers lib.

Sample code causing assert failure (length of input string is 35):
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
wtp.split("This is a test This is another test", lang_code="en")

Sample code that works (with added full-stop that makes the length of input string to become 36):
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
wtp.split("This is a test This is another test.", lang_code="en")

Hi, pip install nnsplit doesn't work

Hello, first of all, nnsplit is really cool, it's really great stuff. :)
I'd really like to run nnsplit on my local computer, but an error occurs when I try to pip install nnsplit:

ERROR: Could not find a version that satisfies the requirement nnsplit (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.2.0, 0.2.1, 0.2.2)
ERROR: No matching distribution found for nnsplit

can I get some helps?

Unable to use own trained onnx models

Hello and first of all: thank you for a great library!

I've tried to train my own model using an unusual input data format following the train Python notebook you've provided. However, after the training, when trying to load the custom model via NNSplit.load("en/model.onnx") call in python bindings, I get this:

nnsplit.ResourceError: model not found: "en/model.onnx"

I may be wrong, but it seems the current logic of model_loader.rs does not allow custom local paths, only the ones that are listed in the models.csv:

https://github.com/bminixhofer/nnsplit/blob/a5a15815382029bf5c3438fd4753f644847d4dbf/nnsplit/src/model_loader.rs#L59

Effectively limiting the available models to the pretrained ones.

Async - Skops import is failing

I am trying with 1.2.1 and 1.2.3 but I have issues like:

1.2.1:

  File "/usr/local/lib/python3.10/site-packages/exorde/prepare_batch.py", line 10, in <module>
    from wtpsplit import WtP
  File "/usr/local/lib/python3.10/site-packages/wtpsplit/__init__.py", line 11, in <module>
    import skops.io as sio
ModuleNotFoundError: No module named 'skops.io'

or with just your latest version 1.2.3:


   from wtpsplit import WtP
  File "/usr/local/lib/python3.10/site-packages/wtpsplit/__init__.py", line 11, in <module>
    import skops.io as sio
  File "/usr/local/lib/python3.10/site-packages/skops/io/__init__.py", line 1, in <module>
    from ._persist import dump, dumps, get_untrusted_types, load, loads
  File "/usr/local/lib/python3.10/site-packages/skops/io/_persist.py", line 22, in <module>
    module = importlib.import_module(module_name, package="skops.io")
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_general.py", line 16, in <module>
    from ._trusted_types import (
  File "/usr/local/lib/python3.10/site-packages/skops/io/_trusted_types.py", line 17, in <module>
    SCIPY_UFUNC_TYPE_NAMES = get_public_type_names(module=scipy.special, oftype=np.ufunc)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 230, in get_public_type_names
    {
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 234, in <setcomp>
    and (type_name := get_type_name(obj)).startswith(module_name)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 179, in get_type_name
    return f"{get_module(t)}.{t.__name__}"
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 86, in get_module
    return whichmodule(obj, obj.__name__)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 49, in whichmodule
    if _getattribute(module, name)[0] is obj:
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 24, in _getattribute
    obj = getattr(obj, subpath)
TypeError: __getattr__() missing 1 required positional argument: 'name'

I am using Python 3.10.11
Any ideas? I can't seem to simply import your lib.

`AttributeError: 'InferenceSession' object has no attribute '_providers' Segmentation fault (core dumped)`

I was trying to segment sentences for my transcribing program, but I ran into this error when I first tried using it this.

Full Error

Traceback (most recent call last):
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 280, in __init__
    self._create_inference_session(providers, provider_options)
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 307, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:142 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "speech.py", line 436, in <module>
    transcript, source_align_data = transcript_audio(input_path, True, transcript_path, granularity=granularity)
  File "speech.py", line 271, in transcript_audio
    sentence_segmenter = NNSplit.load("en")
  File "backend.py", line 6, in create_session
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in __init__
    print("EP Error using {}".format(self._providers))
AttributeError: 'InferenceSession' object has no attribute '_providers'
Segmentation fault (core dumped)

bminixhofer / wtpsplit Goto Github PK

wtpsplit's Introduction

wtpsplit🪓

Segment any Text - Robustly, Efficiently, Adaptably⚡

Installation

Usage

Available Models

Paragraph Segmentation

Adaptation

Advanced Usage

Get the newline or sentence boundary probabilities for a text:

Load a SaT model in HuggingFace transformers:

Adapt to your own corpus via LoRA

Reproducing the paper

Supported Languages

Citations

Acknowledgments

wtpsplit's People

Contributors

Stargazers

Watchers

Forkers

wtpsplit's Issues

Specs:

Steps to reproduce:

Output:

Full Error

Recommend Projects

Recommend Topics

Recommend Org

Load a SaT model in HuggingFace `transformers`: