openpecha / pybo Goto Github PK

View Code? Open in Web Editor NEW

30.0 7.0 12.0 1.15 MB

🦜 NLP for Tibetan, in Python.

Home Page: https://esukhia.github.io/pybo/

License: Apache License 2.0

Python 100.00%

nlp computational-linguistics search ngrams language-models linguistics toolkit tibetan tibetan-nlp

pybo's Introduction

PYBO - Tibetan NLP in Python

Overview

bo tokenizes Tibetan text into words.

Basic usage

Getting started

Requires to have Python3 installed.

python3 -m pip install pybo

Tokenizing a string

drupchen@drupchen:~$ bo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །
སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་
སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"
Loading Trie... (2s.)
༄༅།_། རྒྱ་གར་ སྐད་ དུ །_ བོ་ དྷི་ སཏྭ་ ཙརྻ་ ཨ་བ་ ཏ་ ར །_ བོད་སྐད་ དུ །_ བྱང་ཆུབ་ སེམས་དཔ འི་ སྤྱོད་པ་ ལ་ འཇུག་པ །_། སངས་རྒྱས་ དང་ བྱང་ཆུབ་
སེམས་དཔའ་ ཐམས་ཅད་ ལ་ ཕྱག་ འཚལ་ ལོ །_། བདེ་གཤེགས་ ཆོས་ ཀྱི་ སྐུ་ མངའ་ སྲས་ བཅས་ དང༌ །_། ཕྱག་འོས་ ཀུན་ ལ འང་ གུས་པ ར་ ཕྱག་ འཚལ་
ཏེ །_། བདེ་གཤེགས་ སྲས་ ཀྱི་ སྡོམ་ ལ་ འཇུག་པ་ ནི །_། ལུང་ བཞིན་ མདོར་བསྡུས་ ནས་ ནི་ བརྗོད་པ ར་ བྱ །_།

Tokenizing a list of files

The command to tokenize a list of files in a directory:

bo tok <path-to-directory>

For example to tokenize the file text.txt in a directory ./document/ with the following content:

བཀྲ་ཤི་ས་བདེ་ལེགས་ཕུན་སུམ་ཚོགས། །རྟག་ཏུ་བདེ་བ་ཐོབ་པར་ཤོག། །

I use the command:

$ bo tok ./document/

...which create a file text.txt in a directory ./document_pybo containing:

བཀྲ་ ཤི་ ས་ བདེ་ལེགས་ ཕུན་སུམ་ ཚོགས །_། རྟག་ ཏུ་ བདེ་བ་ ཐོབ་པ ར་ ཤོག །_།

Sorting Tibetan words

$ bo kakha to-sort.txt

The expected input is one word or entry per line in a .txt file. The file will be overwritten.

FNR - Find and Replace with a list of regexes

bo fnr <in-dir> <regex-file> -o <out-dir> -t <tag>

-o and -t are optional

Text files should be UTF-8 plain text files. The regexes should be in the following format:

<find-pattern><tab>-<tab><replace-pattern>

Acknowledgements

pybo is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported pybo's development, especially:

Khyentse Foundation for contributing USD22,000 to kickstart the project
The Barom/Esukhia canon project for sponsoring training data curation
BDRC for contributing 2 staff for 6 months for data curation

third_party/rules.txt is taken from tibetan-collation.

Contributing

First clone this repo. Create virtual environment and activate it. Then install the dependencies

$ pip install -e .
$ pip install -r requirements-dev.txt

Next, setup up pre-commit by creating pre-commit git hook

$ pre-commit install

Please, follow augular commit message format for commit message. We have setup python-semantic-release to publish pybo package automatically based on commit messages.

That's all, Enjoy contributing 🎉🎉🎉

License

contributors:

Drupchen
Élie Roux
Ngawang Trinley
Tenzin
Joyce Mackzenzie for reworking the logo

pybo's People

Contributors

Stargazers

Watchers

Forkers

stellakunzang yongtso xxyjt pechawa diannt leiwng computational-linguistics-research sophieduan x39826 stlm1376 zoeyaaaa

pybo's Issues

Customised tok command

Change tok with arg "rtpls" -> raw-text, clean-text, pos, lemma, sense. Output should in order of the specified tag arg.

Examples of all tags

$ pybo tok --tags rtpls <text>
output:
<raw-text1>/<clean-text1>/<pos1>/<lemma1>/<sense1> <raw-text2>/<clean-text2>/<pos2>/<lemma2>/<sense2>

Example of few tags

$ pybo tok --tags tp <text>
output:
<clean-text1>/<pos1> <clean-text2>/<pos2>

Untokenise functionality needed

guessing dadrak

In some contexts (e.g. phonetics), it would be useful to know if there's a da drag or not on a certain syllable, this can often be guessed through context.

For instance གྱུར, བསྐོར or བསྟན don't always have an invisible da drag, but when they are followed by a particle, the particle's form indicates that there is or isn't a da drag:

བསྐོར་ཅིང --> བསྐོར has a da drag
བསྐོར་ཞིང --> བསྐོར has no da drag
བསྟན་ཀྱང --> བསྟན has a da drag
བསྟན་ཡང --> བསྟན has no da drag

etc.

There are of course a few words that conventionally take a da drag, but it seems there's no consensus on the list, apart from ཀུན, ཤིན and འོན

I have install the pybo by python3 -m pip install pybo
However, I can not run the follow command:
pybo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །
སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་
སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"

which ouput "Command 'pybo' not found, did you mean:"

syl-based content shelving and reinsertion

User Stories

#	As a...	I want to...	so that...
1.	researcher	segment a collection of Tibetan texts	I can do statistics in AntConc
2.	tibetan text proofreader	mark potential errors	I can catch and correct more mistakes
3.	corpus researcher for amdo dialects	create several custom profiles	I can do statistics on different spoken dialects
4.	corpus researcher on literary Tibetan	create a custom profile for the kangyur	I can do accurate statistics on the kangyur and tengyur
5.

Rule based segmentation steps (for story 3 & 4)

Segment a volume with the default profile
Create a word list from the volume, ordered by frequency
Manually cleanup the wordlist
Use the wordlist as the main list
Segment the volume again
Edit the custom profile (word /remove /adjustments) till the segmentation is good
Merge custom profile with main profile
Repeat with a second volume