dadmatech / dadmatools Goto Github PK

View Code? Open in Web Editor NEW

179.0 8.0 39.0 94.76 MB

DadmaTools is a Persian NLP tools developed by Dadmatech Co.

License: Apache License 2.0

Python 100.00%

natural-language-processing persian nlptoolkit tokenizer lemmatizer postagger dependency-parser persian-nlp ner spacy

dadmatools's Introduction

DadmaTools: A Python NLP Library for Persian

Named Entity Recognition | Part of Speech Tagging | Dependency Parsing | Informal To Formal

Constituency Parsing | Chunking | Kasreh Ezafe Detection

Spellchecker | Normalizer | Tokenizer | Lemmatizer | Sentiment Analysis

DadmaTools

DadmaTools is a repository for Natural Language Processing resources for the Persian Language. The aim is to make it easier and more applicable to practitioners in the industry to use Persian NLP, and hence this project is licensed to allow commercial use. The project features code examples on how to use the models in popular NLP frameworks such as spaCy and Transformers, as well as Deep Learning frameworks such as PyTorch. Furthermore, DadmaTools support common Persian embedding and Persian datasets. for more details about how to use this tool read the instruction below.

Contents:

Installation
NLP Models
- Normalizer
- Pipline (tok,lem,dep,pos,cons,chunk,kasreh,spellchecker)
Datasets
Embeddings
Evaluation
How to use in colab
Cite us

Installation

To get started using DadmaTools in your python project, simply install via the pip package. Note that installing the default pip package will not install all NLP libraries because we want you to have the freedom to limit the dependency on what you use. Instead, we provide you with an installation option if you want to install all the required dependencies.

Install with pip

To get started using DadmaTools, simply install the project with pip:

pip install dadmatools

Note that the default installation of DadmaTools does install other NLP libraries such as SpaCy and supar.

You can check the requirements.txt file to see what version the packages has been tested with.

Install from github

Alternatively you can install the latest version from github using:

pip install git+https://github.com/Dadmatech/dadmatools.git

NLP Models

Natural Language Processing is an active area of research, and it consists of many different tasks. The DadmaTools repository provides an overview of Persian models for some of the most basic NLP tasks (and is continuously evolving).

Here is the list of NLP tasks we currently cover in the repository. These NLP tasks are defined as pipelines. Therefore, a pipeline list must be created and passed through the model. This will allow the user to choose the only task needed without loading others. Each task has its abbreviation as follows:

Named Entity Recognition: ner
Part of speech tagging: pos
Dependency parsing: dep
Constituency parsing: cons
Kasreh Ezafe Detection: kasreh
Chunking: chunk
Lemmatizing: lem
Tokenizing: tok
Spellchecker: spellchecker
Normalizing
informal2formal: itf
Sentiment analysis: sent

Note that the normalizer can be used outside of the pipeline as there are several configs (the default config is in the pipeline with the name of def-norm). Note that if no pipeline is passed to the model, the tokenizer will be loaded as default.

Normalizer

cleaning text and unify characters.

Note: None means no action!

from dadmatools.normalizer import Normalizer

normalizer = Normalizer(
    full_cleaning=False,
    unify_chars=True,
    refine_punc_spacing=True,
    remove_extra_space=True,
    remove_puncs=False,
    remove_html=False,
    remove_stop_word=False,
    replace_email_with="<EMAIL>",
    replace_number_with=None,
    replace_url_with="",
    replace_mobile_number_with=None,
    replace_emoji_with=None,
    replace_home_number_with=None
)

text = """
<p>
دادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده. 
امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه
لطفا با ایمیل [email protected] با ما در ارتباط باشید
آدرس گیت‌هاب هم که خب معرف حضور مبارک هست:
 https://github.com/Dadmatech/DadmaTools
</p>
"""
normalized_text = normalizer.normalize(text)
# <p> دادماتولز اولین نسخش سال 1400 منتشر شده. امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه لطفا با ایمیل <EMAIL> با ما در ارتباط باشید آدرس گیت‌هاب هم که خب معرف حضور مبارک هست: </p>

# full cleaning
normalizer = Normalizer(full_cleaning=True)
normalized_text = normalizer.normalize(text)
# دادماتولز نسخش سال منتشر تولز بتونه کار متن براتون شیرین‌تر راحت‌تر کنه ایمیل ارتباط آدرس گیت‌هاب معرف حضور مبارک

Pipeline

Containing Tokenizer, Lemmatizer, POS Tagger, Dependancy Parser, Constituency Parser, Kasreh, Spellcheker, Infromal To Formal, Name Entity Recognation.

import dadmatools.pipeline.language as language

# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'tok, lem, pos, dep, chunk, cons, spellchecker, kasreh, itf, ner, sent'
nlp = language.Pipeline(pips)
# doc is an SpaCy object
doc = nlp('کشور بزرگ ایران توانسته در طی سال‌ها اغشار مختلفی از قومیت‌های گوناگون رو به خوبی تو خودش  جا بده')

doc object has different extensions. First, there are sentences in doc which is the list of the list of Token. Each Token also has its own extensions. Note that we defined our own extension as well in DadmaTools. If any pipeline related to the specific extensions is not called, that extension will have no value.

To better see the results which you can use this code:

print(doc)

{'spellchecker': {'orginal': 'کشور بزرگ ایران توانسته در طی سال\u200cها اغشار مختلفی از قومیت\u200cهای گوناگون رو به خوبی تو خودش  جا بده', 'corrected': 'کشور بزرگ ایران توانسته در طی سال\u200cها اقشار مختلفی از قومیت\u200cهای گوناگون رو به خوبی تو خودش جا بده', 'checked_words': [('اغشار', 'اقشار')]}, 'itf': ' کشور بزرگ ایران توانسته در طی سال\u200cها اغشار مختلفی از قومیت های گوناگون را به خوبی در خودش جا بده', 'sentences': [{'id': 1, 'tokens': [{'id': 1, 'text': 'کشور', 'upos': 'NOUN', 'xpos': 'N_PL', 'feats': 'Number=Plur|Person=2|Polarity=Neg|Tense=Pres', 'head': 14, 'deprel': 'nsubj', 'lemma': 'کشور', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 2, 'text': 'بزرگ', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 8, 'deprel': 'amod', 'lemma': 'بزرگ', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 3, 'text': 'ایران', 'upos': 'SCONJ', 'xpos': 'CON', 'feats': 'Number=Plur|Person=3|PronType=Prs', 'head': 2, 'deprel': 'nmod:poss', 'lemma': 'ایران', 'ner': 'S-loc', 'kasreh': 'O'}, {'id': 4, 'text': 'توانسته', 'upos': 'VERB', 'xpos': 'V_PP', 'feats': 'Number=Sing|Person=3|VerbForm=Part', 'head': 14, 'deprel': 'aux', 'lemma': 'توانست#توان', 'ner': 'O', 'kasreh': 'O'}, {'id': 5, 'text': 'در', 'upos': 'ADP', 'xpos': 'P', 'head': 14, 'deprel': 'case', 'lemma': 'در', 'ner': 'O', 'kasreh': 'O'}, {'id': 6, 'text': 'طی', 'upos': 'ADP', 'xpos': 'P', 'head': 5, 'deprel': 'fixed', 'lemma': 'طی', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 7, 'text': 'سال\u200cها', 'upos': 'AUX', 'xpos': 'V_PRS', 'feats': 'Number=Sing|Person=3|Tense=Pres', 'head': 14, 'deprel': 'fixed', 'lemma': 'سال', 'ner': 'O', 'kasreh': 'O'}, {'id': 8, 'text': 'اغشار', 'upos': 'NOUN', 'xpos': 'N_PL', 'feats': 'Number=Plur', 'head': 19, 'deprel': 'nsubj', 'lemma': 'اغشار', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 9, 'text': 'مختلفی', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 8, 'deprel': 'amod', 'lemma': 'مختلفی', 'ner': 'O', 'kasreh': 'O'}, {'id': 10, 'text': 'از', 'upos': 'ADP', 'xpos': 'P', 'head': 15, 'deprel': 'case', 'lemma': 'از', 'ner': 'O', 'kasreh': 'O'}, {'id': 11, 'text': 'قومیت\u200cهای', 'upos': 'NOUN', 'xpos': 'N_PL', 'feats': 'Number=Plur', 'head': 8, 'deprel': 'nmod:poss', 'lemma': 'قومیت', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 12, 'text': 'گوناگون', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 11, 'deprel': 'amod', 'lemma': 'گوناگون', 'ner': 'O', 'kasreh': 'O'}, {'id': 13, 'text': 'رو', 'upos': 'PART', 'xpos': 'CLITIC', 'head': 8, 'deprel': 'case', 'lemma': 'رو', 'ner': 'O', 'kasreh': 'O'}, {'id': 14, 'text': 'به', 'upos': 'ADP', 'xpos': 'P', 'head': 19, 'deprel': 'case', 'lemma': 'به', 'ner': 'O', 'kasreh': 'O'}, {'id': 15, 'text': 'خوبی', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 14, 'deprel': 'advcl', 'lemma': 'خوب', 'ner': 'O', 'kasreh': 'O'}, {'id': 16, 'text': 'تو', 'upos': 'ADP', 'xpos': 'P', 'feats': 'Number=Sing|Person=2|PronType=Prs', 'head': 19, 'deprel': 'case', 'lemma': 'تو', 'ner': 'O', 'kasreh': 'O'}, {'id': 17, 'text': 'خودش', 'upos': 'PRON', 'xpos': 'PRO', 'feats': 'Number=Sing|Person=3|PronType=Prs|Reflex=Yes', 'head': 19, 'deprel': 'obl', 'lemma': 'خودش', 'ner': 'O', 'kasreh': 'O'}, {'id': 18, 'text': 'جا', 'upos': 'VERB', 'xpos': 'PREV', 'feats': 'Number=Sing|Person=3|Tense=Pres', 'head': 19, 'deprel': 'compound:lvc', 'lemma': 'جا', 'ner': 'O', 'kasreh': 'O'}, {'id': 19, 'text': 'بده', 'upos': 'VERB', 'xpos': 'V_SUB', 'feats': 'Mood=Sub', 'head': 0, 'deprel': 'root', 'lemma': 'داد#ده', 'ner': 'O', 'kasreh': 'O'}]}], 'lang': 'persian', 'sentiment': [{'label': 'positive', 'score': 0.7366364598274231}]}

Loading Persian NLP Datasets

We provide an easy-to-use way to load some popular Persian NLP datasets

Here is the list of supported datasets.

Dataset	Task
PersianNER	Named Entity Recognition
ARMAN	Named Entity Recognition
Peyma	Named Entity Recognition
FarsTail	Textual Entailment
FaSpell	Spell Checking
PersianNews	Text Classification
PerUDT	Universal Dependency
PnSummary	Text Summarization
SnappfoodSentiment	Sentiment Classification
TEP	Text Translation(eng-fa)
WikipediaCorpus	Corpus
PersianTweets	Corpus

all datasets are iterator and can be used like below:

from dadmatools.datasets import FarsTail
from dadmatools.datasets import SnappfoodSentiment
from dadmatools.datasets import Peyma
from dadmatools.datasets import PerUDT
from dadmatools.datasets import PersianTweets
from dadmatools.datasets import PnSummary


farstail = FarsTail()
#len of dataset
print(len(farstail.train))

#like a generator
print(next(farstail.train))

#dataset details
pn_summary = PnSummary()
print('PnSummary dataset information: ', pn_summary.info)

#loop over dataset
snpfood_sa = SnappfoodSentiment()
for i, item in enumerate(snpfood_sa.test):
    print(item['comment'], item['label'])

#get first tokens' lemma of all dev items
perudt = PerUDT()
for token_list in perudt.dev:
    print(token_list[0]['lemma'])

#get NER tag of first Peyma's data
peyma = Peyma()
print(next(peyma.data)[0]['tag'])

#corpus 
tweets = PersianTweets()
print('tweets count : ', len(tweets.data))
print('sample tweet: ', next(tweets.data))

get dataset info:

from dadmatools.datasets import get_all_datasets_info

get_all_datasets_info().keys()
#dict_keys(['Persian-NEWS', 'fa-wiki', 'faspell', 'PnSummary', 'TEP', 'PerUDT', 'FarsTail', 'Peyma', 'snappfoodSentiment', 'Persian-NER', 'Arman', 'PerSent'])

#specify task
get_all_datasets_info(tasks=['NER', 'Sentiment-Analysis'])

the output will be:

{"ARMAN": {"description": "ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.\n\nOrganization\nLocation\nFacility\nEvent\nProduct\nPerson",
  "filenames": ["train_fold1.txt",
   "train_fold2.txt",
   "train_fold3.txt",
   "test_fold1.txt",
   "test_fold2.txt",
   "test_fold3.txt"],
  "name": "ARMAN",
  "size": {"test": 7680, "train": 15361},
  "splits": ["train", "test"],
  "task": "NER",
  "version": "1.0.0"},
 "PersianNer": {"description": "source: https://github.com/Text-Mining/Persian-NER",
  "filenames": ["Persian-NER-part1.txt",
   "Persian-NER-part2.txt",
   "Persian-NER-part3.txt",
   "Persian-NER-part4.txt",
   "Persian-NER-part5.txt"],
  "name": "PersianNer",
  "size": 976599,
  "splits": [],
  "task": "NER",
  "version": "1.0.0"},
 "Peyma": {"description": "source: http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/",
  "filenames": ["peyma/600K", "peyma/300K"],
  "name": "Peyma",
  "size": 10016,
  "splits": [],
  "task": "NER",
  "version": "1.0.0"},
 "snappfoodSentiment": {"description": "source: https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood",
  "filenames": ["snappfood/train.csv",
   "snappfood/test.csv",
   "snappfood/dev.csv"],
  "name": "snappfoodSentiment",
  "size": {"dev": 6274, "test": 6972, "train": 56516},
  "splits": ["train", "test", "dev"],
  "task": "Sentiment-Analysis",
  "version": "1.0.0"}}

Loading Persian Word Embeddings

To start using embedding please install fasttext:

pip install fasttext

download, load and use some pre-trained Persian word embeddings.

dadmatools supports all glove, fasttext, and word2vec formats.

from dadmatools.embeddings import get_embedding, get_all_embeddings_info, get_embedding_info
from pprint import pprint

pprint(get_all_embeddings_info())

#get embedding information of specific embedding
embedding_info = get_embedding_info('glove-wiki')

#### load embedding ####
word_embedding = get_embedding('glove-wiki')

#get vector of the word
print(word_embedding['سلام'])

#vocab
vocab = word_embedding.get_vocab()

### some useful functions ###
print(word_embedding.top_nearest("زمستان", 10))
print(word_embedding.similarity('کتب', 'کتاب'))
print(word_embedding.embedding_text('امروز هوای خوبی بود'))

The following word embeddings are currently supported:

Name	Embedding Algorithm	Corpus
`glove-wiki`	glove	Wikipedia
`fasttext-commoncrawl-bin`	fasttext	CommonCrawl
`fasttext-commoncrawl-vec`	fasttext	CommonCrawl
`word2vec-conll`	word2vec	Persian CoNLL17 corpus

Evaluation

We have compared our pos tagging, dependancy parsing, and lemmatization models to stanza and hazm.

PerDT (F1 score)
Toolkit	POS Tagger (UPOS)	Dependancy Parser (UAS/LAS)	Lemmatizer
DadmaTools	97.52%	95.36% / 92.54%	99.14%
stanza	97.35%	93.34% / 91.05%	98.97%
hazm	-	-	89.01%
Seraji (F1 score)
Toolkit	POS Tagger (UPOS)	Dependancy Parser (UAS/LAS)	Lemmatizer
DadmaTools	97.83%	92.5% / 89.23%	-
stanza	97.43%	87.20% / 83.89%	-
hazm	-	-	86.93%

Tehran university tree bank (F1 score)
Toolkit	Constituency Parser
DadmaTools (without preprocess))	82.88%
Stanford (with some preprocess on POS tags)	80.28

How to use

You can see the codes and the output in colab.

Cite

@inproceedings{etezadi-etal-2022-dadmatools,
    title = "{D}adma{T}ools: Natural Language Processing Toolkit for {P}ersian Language",
    author = "Etezadi, Romina  and
      Karrabi, Mohammad  and
      Zare, Najmeh  and
      Sajadi, Mohamad Bagher  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations",
    month = jul,
    year = "2022",
    address = "Hybrid: Seattle, Washington + Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-demo.13",
    pages = "124--130",
    abstract = "We introduce DadmaTools, an open-source Python Natural Language Processing toolkit for the Persian language. The toolkit is a neural pipeline based on spaCy for several text processing tasks, including normalization, tokenization, lemmatization, part-of-speech, dependency parsing, constituency parsing, chunking, and ezafe detecting. DadmaTools relies on fine-tuning of ParsBERT using the PerDT dataset for most of the tasks. Dataset module and embedding module are included in DadmaTools that support different Persian datasets, embeddings, and commonly used functions for them. Our evaluations show that DadmaTools can attain state-of-the-art performance on multiple NLP tasks. The source code is freely available at https://github.com/Dadmatech/DadmaTools.",
}

dadmatools's People

Contributors

Stargazers

Watchers

dadmatools's Issues

word embedding

Thanks for developing such a nice library;

I wonder why english words have not removed before training, let's say Glove ?
I wonder if you can also share the code to train word embedding (fast text, Glove)

Thanks.

Replacing 'sentences' with 'sentence'

Add document Clustering task

update readthedoc for new version

Comparing our tool with other Persian tools

we have to compare our model with other in terms of velocity, performance and models.

How can I train Dadmatech/Nevise on a custom dataset?

Congratulations on the tool you have provided.
I had a few questions about the Nevise model:
Is this model trained only with FAspell dataset?
If another dataset is used:
What is the structure of the database you used to train this model?

Does the dataset include correct and misspelled labels for words, or have you used sentences and labeled each of them as correct or misspelled?

How many sentences (or whatever) did you use to train the model?

Is it possible to access the data you used?

What is the method of training the model? Is the model being trained continuously or is it trained once and can be used now?

Error when importing "import dadmatools.pipeline.language as language"

Hello, I Have an issue that when I try to import import dadmatools.pipeline.language as language in my local machine I face this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 220: character maps to undefined
How can I fix this?

This is the full trace of the error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[34], line 1
----> 1 import dadmatools.pipeline.language as language
      3 # here lemmatizer and pos tagger will be loaded
      4 # as tokenizer is the default tool, it will be loaded as well even without calling
      5 pips = 'lem'

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\__init__.py:1
----> 1 from .language import Pipeline
      2 from .tpipeline import TPipeline
      3 from .language import supported_langs, langwithner, remove_with_path

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\language.py:4
      1 from typing import List
      3 from .config import config as master_config
----> 4 from .informal2formal.main import Informal2Formal
      5 from .models.base_models import Multilingual_Embedding
      6 from .models.classifiers import TokenizerClassifier, PosDepClassifier, NERClassifier, SentenceClassifier, \
      7     KasrehClassifier

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\informal2formal\main.py:6
      4 import yaml
      5 from .download_utils import download_dataset
----> 6 import dadmatools.pipeline.informal2formal.utils as utils
      7 from .formality_transformer import FormalityTransformer
      8 from dadmatools.pipeline.persian_tokenization.tokenizer import SentenceTokenizer

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\informal2formal\utils.py:10
      7 from dadmatools.pipeline.persian_tokenization.tokenizer import WordTokenizer
      8 from dadmatools.normalizer import Normalizer
---> 10 normalizer = Normalizer()
     11 tokenizer = WordTokenizer('cache/dadmatools')
     12 # tokenizer = WordTokenizer(separate_emoji=True)

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\normalizer.py:32, in Normalizer.__init__(self, full_cleaning, unify_chars, refine_punc_spacing, remove_extra_space, remove_puncs, remove_html, remove_stop_word, replace_email_with, replace_number_with, replace_url_with, replace_mobile_number_with, replace_emoji_with, replace_home_number_with)
     30 self.remove_puncs = remove_puncs
     31 self.remove_stop_word = remove_stop_word
---> 32 self.STOPWORDS = open(prefix+save_dir+'stopwords-fa.py').read().splitlines()
     33 self.PUNCS = string.punctuation.replace('<', '').replace('>', '') + '،؟'
     34 if full_cleaning:

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 220: character maps to <undefined>

Cannot load fa_tokenizer.pt

I have downloaded fa_tokenizer.pt manually from the URL https://www.dropbox.com/s/bajpn68bp11o78s/fa_ewt_tokenizer.pt?dl=1. It's 636k in size. Its md5 is:

2097a125c5f85b36d569857bd60d51b7  fa_tokenizer.pt

It cannot be loaded, however:

import dadmatools.pipeline.language as language

# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh' 
nlp = language.Pipeline(pips)

# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))

# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که می‌گفت، گاهی حرص می‌خورد!')

Model fa_tokenizer exists in /Users/evar/.pernlp/fa_tokenizer.pt
2022-11-21 09:05:41,580 Cannot load model from /Users/evar/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/saved_models/fa_tokenizer/fa_tokenizer.pt
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [5], in <cell line: 6>()
      3 # here lemmatizer and pos tagger will be loaded
      4 # as tokenizer is the default tool, it will be loaded as well even without calling
      5 pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh' 
----> 6 nlp = language.Pipeline(pips)
      8 # you can see the pipeline with this code
      9 print(nlp.analyze_pipes(pretty=True))

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:258, in Pipeline.__new__(cls, pipeline)
    257 def __new__(cls, pipeline):
--> 258     language = NLP('fa', pipeline)
    259     nlp = language.nlp
    260     return nlp

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:64, in NLP.__init__(self, lang, pipelines)
     58 # if 'def-norm' in pipelines:
     59 #     global normalizer_model
     60 #     normalizer_model = normalizer.load_model()
     61 #     self.nlp.add_pipe('normalizer', first=True)
     63 global tokenizer_model
---> 64 tokenizer_model = tokenizer.load_model()
     65 self.nlp.add_pipe('tokenizer')
     67 global mwt_model

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenizer.py:125, in load_model()
    123 mwt_dict = load_mwt_dict(args['mwt_json_file'])
    124 use_cuda = args['cuda'] and not args['cpu']
--> 125 trainer = Trainer(model_file=args['save_dir'], use_cuda=use_cuda)
    126 loaded_args, vocab = trainer.args, trainer.vocab
    128 for k in loaded_args:

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:19, in Trainer.__init__(self, args, vocab, model_file, use_cuda)
     16 self.use_cuda = use_cuda
     17 if model_file is not None:
     18     # load everything from file
---> 19     self.load(model_file)
     20 else:
     21     # build model from scratch
     22     self.args = args

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:85, in Trainer.load(self, filename)
     83 def load(self, filename):
     84     try:
---> 85         checkpoint = torch.load(filename, lambda storage, loc: storage)
     86     except BaseException:
     87         logger.error("Cannot load model from {}".format(filename))

File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:713, in load(f, map_location, pickle_module, **pickle_load_args)
    711             return torch.jit.load(opened_file)
    712         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)

File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:938, in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    936 assert key in deserialized_objects
    937 typed_storage = deserialized_objects[key]
--> 938 typed_storage._storage._set_from_file(
    939     f, offset, f_should_read_directly,
    940     torch._utils._element_size(typed_storage.dtype))
    941 if offset is not None:
    942     offset = f.tell()

RuntimeError: unexpected EOF, expected 312321 more bytes. The file might be corrupted.

I am using dadmatools==1.5.2, Python 3.10, macOS 12.2.1.

Improve NER

NER need some important labels like date, and also we should improve generalization of the model.

Error on loading parsbert

Hi
I have an issue while loading models with python3.9:

RuntimeError Traceback (most recent call last)
Input In [31], in
1 import dadmatools.pipeline.language as language
3 pips = 'tok,lem,pos,dep,chunk,cons'
----> 4 nlp = language.Pipeline(pips)
5 def dadmatokenize(text):
6 doc = nlp(text)

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/pipeline/language.py:216, in Pipeline.new(cls, pipeline)
215 def new(cls, pipeline):
--> 216 language = NLP('fa', pipeline)
217 nlp = language.nlp
218 return nlp

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/pipeline/language.py:71, in NLP.init(self, lang, pipelines)
69 if 'dep' or 'chunk' in pipelines:
70 global depparser_model
---> 71 depparser_model = dp.load_model()
72 self.nlp.add_pipe('dependancyparser')
74 if 'pos' or 'chunk' in pipelines:

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/dependancy_parser.py:148, in load_model()
145 config['target_dir'] = prefix + config['target_dir']
146 config['embeddings']['BertEmbeddings-0']['bert_model_or_path'] = prefix + config['embeddings-saved-dir']
--> 148 student=create_model(config)
149 base_path=Path(config['target_dir'])/config['model_name']
151 return student

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/dependancy_parser.py:120, in create_model(config)
118 tagger = tagger.load(base_path / "final-model.pt")
119 elif (base_path).exists():
--> 120 tagger = tagger.load(base_path)
121 else:
122 assert 0, str(base_path)+ ' not exist!'

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/flair/nn.py:105, in Model.load(cls, model, device)
102 # load_big_file is a workaround by https://github.com/highway11git to load models on some Mac/Windows setups
103 # see flairNLP/flair#351
104 f = flair.file_utils.load_big_file(str(model_file))
--> 105 state = torch.load(f, map_location=device)
107 model = cls._init_model_with_state_dict(state, testing = device=='cpu')
109 model.eval()

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/torch/serialization.py:600, in load(f, map_location, pickle_module, **pickle_load_args)
595 if _is_zipfile(opened_file):
596 # The zipfile reader is going to advance the current file position.
597 # If we want to actually tail call to torch.jit.load, we need to
598 # reset back to the original position.
599 orig_position = opened_file.tell()
--> 600 with _open_zipfile_reader(opened_file) as opened_zipfile:
601 if _is_torchscript_zip(opened_zipfile):
602 warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
603 " dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to"
604 " silence this warning)", UserWarning)

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/torch/serialization.py:242, in _open_zipfile_reader.init(self, name_or_buffer)
241 def init(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

"Pip install dadmatools" install previous version of dadmatools.

we should having 2 version, one for legacy and one for adapter version.

dose not recognize dadmatools as pakage in windows

hello when i import the library it throws error that dadmatools is not a package
import dadmatools.pipline.language ModuleNotFoundError: No module named 'dadmatools.pipline'; 'dadmatools' is not a package

Split emojis in tokenizer

Datasets could not be downloaded

some of datasets have runtime error to download

Is there a way to correct the half-spaces?

Hello again,
As you should know, correcting the half-spaces in many languages such as Persian can affect the performance of other NLP tools and applications. Is there a way in DadmaTools to correct the half-spaces?
Thank you

Set version 3.0.0 as a default version.

Installation Error in Ubuntu

ERROR: Could not build wheels for spacy, tokenizers which use PEP 517 and cannot be installed directly

I installed Rust, but the error still remains.

thank you in advance.

Fix bug of sentiment analyzer in adapter.

The 'sklearn' PyPI package is deprecated, use 'scikit-learn'

When installing the package, the following error occurs:
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'

parsbert error

file /home/milad/anaconda3/lib/python3.9/site-packages/dadmatools/saved_models/parsbert/parsbert/config.json not found

OSError Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
511 # Load from URL or cache if already cached
--> 512 resolved_config_file = cached_path(
513 config_file,

~/anaconda3/lib/python3.9/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
1377 # File, but it doesn't exist.
-> 1378 raise EnvironmentError(f"file {url_or_filename} not found")
1379 else:

OSError: file /home/milad/anaconda3/lib/python3.9/site-packages/dadmatools/saved_models/parsbert/pa

Topic classification

It should have a feature to add new classes to the classification method.

Constituency parser

I am using 'import dadmatools.models.constituency_parser as cons' as written in test file, but I get 'ModuleNotFoundError: No module named 'dadmatools.models'' error. How can I solve it?

pip can't install dadmatools

I tried to use colab you provided, but unfortunately It can't install dadmatools properly!!!

Here I provide the error below for you:

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dadmatools
  Using cached dadmatools-1.5.2-py3-none-any.whl (862 kB)
Collecting bpemb>=0.3.3 (from dadmatools)
  Using cached bpemb-0.3.4-py3-none-any.whl (19 kB)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from dadmatools) (3.8.1)
Requirement already satisfied: folium>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from dadmatools) (0.14.0)
Requirement already satisfied: spacy>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from dadmatools) (3.5.2)
Collecting sklearn>=0.0 (from dadmatools)
  Using cached sklearn-0.0.post5.tar.gz (3.7 kB)
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I really want to use your tool but I don't know how, Thanks for your project

Having an Adapter Pipeline with multiple pre-train models

add sentiment analysis on github README.md

Potential performance Issue: Slow read_csv() Function with pandas 1.3.3

Issue Description:

Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.3. And I notice some parts of the repository depend on pandas 1.3.3 in dadmatools/requirements.txt and some other dependencies require pandas below 1.4. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #44158 and #44610.
I also found that dadmatools/pipeline/informal2formal/utils.py and dadmatools/pipeline/informal2formal/VerbHandler.py used the influenced api. There may be more files using the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 1.4 or exploring other solutions to optimize the performance of read_csv.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

پسوند افعال

سلام و عرض ادب. ممنون بابت توسعه این ابزار
من میخواستم پسوند یا همان شناسه یک فعل را پیدا کنم ولی نمیدونستم اصلا امکان این وجود داره یا نه
مثلا برای فعل می‌رویم، 'یم' بهم داده بشه. چند تا مثال دیگه به این شکل هست:
خواهم نوشت => م
می‌نویسند => ند
آیا امکان اینکار برای زبان فارسی وجود داره و با این ابزار میشه این کار رو انجام داد؟
اگر پاسخ بدید ممنون میشم