Giter VIP home page Giter VIP logo

pybo's Introduction

PYBO - Tibetan NLP in Python

PyPI version Test Test Coverage Publish Code style: black

Overview

bo tokenizes Tibetan text into words.

Basic usage

Getting started

Requires to have Python3 installed.

python3 -m pip install pybo

Tokenizing a string

drupchen@drupchen:~$ bo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །
སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་
སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"
Loading Trie... (2s.)
༄༅།_། རྒྱ་གར་ སྐད་ དུ །_ བོ་ དྷི་ སཏྭ་ ཙརྻ་ ཨ་བ་ ཏ་ ར །_ བོད་སྐད་ དུ །_ བྱང་ཆུབ་ སེམས་དཔ འི་ སྤྱོད་པ་ ལ་ འཇུག་པ །_། སངས་རྒྱས་ དང་ བྱང་ཆུབ་
སེམས་དཔའ་ ཐམས་ཅད་ ལ་ ཕྱག་ འཚལ་ ལོ །_། བདེ་གཤེགས་ ཆོས་ ཀྱི་ སྐུ་ མངའ་ སྲས་ བཅས་ དང༌ །_། ཕྱག་འོས་ ཀུན་ ལ འང་ གུས་པ ར་ ཕྱག་ འཚལ་
ཏེ །_། བདེ་གཤེགས་ སྲས་ ཀྱི་ སྡོམ་ ལ་ འཇུག་པ་ ནི །_། ལུང་ བཞིན་ མདོར་བསྡུས་ ནས་ ནི་ བརྗོད་པ ར་ བྱ །_།

Tokenizing a list of files

The command to tokenize a list of files in a directory:

bo tok <path-to-directory>

For example to tokenize the file text.txt in a directory ./document/ with the following content:

བཀྲ་ཤི་ས་བདེ་ལེགས་ཕུན་སུམ་ཚོགས། །རྟག་ཏུ་བདེ་བ་ཐོབ་པར་ཤོག། །

I use the command:

$ bo tok ./document/

...which create a file text.txt in a directory ./document_pybo containing:

བཀྲ་ ཤི་ ས་ བདེ་ལེགས་ ཕུན་སུམ་ ཚོགས །_། རྟག་ ཏུ་ བདེ་བ་ ཐོབ་པ ར་ ཤོག །_།

Sorting Tibetan words

$ bo kakha to-sort.txt

The expected input is one word or entry per line in a .txt file. The file will be overwritten.

FNR - Find and Replace with a list of regexes

bo fnr <in-dir> <regex-file> -o <out-dir> -t <tag>

-o and -t are optional

Text files should be UTF-8 plain text files. The regexes should be in the following format:

<find-pattern><tab>-<tab><replace-pattern>

Acknowledgements

  • pybo is an open source library for Tibetan NLP.

We are always open to cooperation in introducing new features, tool integrations and testing solutions.

Many thanks to the companies and organizations who have supported pybo's development, especially:

Contributing

First clone this repo. Create virtual environment and activate it. Then install the dependencies

$ pip install -e .
$ pip install -r requirements-dev.txt

Next, setup up pre-commit by creating pre-commit git hook

$ pre-commit install

Please, follow augular commit message format for commit message. We have setup python-semantic-release to publish pybo package automatically based on commit messages.

That's all, Enjoy contributing 🎉🎉🎉

License

The Python code is Copyright (C) 2019 Esukhia, provided under Apache 2.

contributors:

pybo's People

Contributors

10zinten avatar actions-user avatar drupchen avatar kaldan007 avatar ngawangtrinley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pybo's Issues

Customised tok command

Change tok with arg "rtpls" -> raw-text, clean-text, pos, lemma, sense. Output should in order of the specified tag arg.

Examples of all tags

$ pybo tok --tags rtpls <text>
output:
<raw-text1>/<clean-text1>/<pos1>/<lemma1>/<sense1> <raw-text2>/<clean-text2>/<pos2>/<lemma2>/<sense2>

Example of few tags

$ pybo tok --tags tp <text>
output:
<clean-text1>/<pos1> <clean-text2>/<pos2>

guessing dadrak

In some contexts (e.g. phonetics), it would be useful to know if there's a da drag or not on a certain syllable, this can often be guessed through context.

For instance གྱུར, བསྐོར or བསྟན don't always have an invisible da drag, but when they are followed by a particle, the particle's form indicates that there is or isn't a da drag:

  • བསྐོར་ཅིང --> བསྐོར has a da drag
  • བསྐོར་ཞིང --> བསྐོར has no da drag
  • བསྟན་ཀྱང --> བསྟན has a da drag
  • བསྟན་ཡང --> བསྟན has no da drag

etc.

There are of course a few words that conventionally take a da drag, but it seems there's no consensus on the list, apart from ཀུན, ཤིན and འོན

Command 'pybo' not found

I have install the pybo by python3 -m pip install pybo
However, I can not run the follow command:
pybo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །
སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་
སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"

which ouput "Command 'pybo' not found, did you mean:"

User Stories

User Stories

# As a... I want to... so that...
1. researcher segment a collection of Tibetan texts I can do statistics in AntConc
2. tibetan text proofreader mark potential errors I can catch and correct more mistakes
3. corpus researcher for amdo dialects create several custom profiles I can do statistics on different spoken dialects
4. corpus researcher on literary Tibetan create a custom profile for the kangyur I can do accurate statistics on the kangyur and tengyur
5.

Rule based segmentation steps (for story 3 & 4)

  1. Segment a volume with the default profile
  2. Create a word list from the volume, ordered by frequency
  3. Manually cleanup the wordlist
  4. Use the wordlist as the main list
  5. Segment the volume again
  6. Edit the custom profile (word /remove /adjustments) till the segmentation is good
  7. Merge custom profile with main profile
  8. Repeat with a second volume

Steps for story # & #

pybo silently processes a file as a folder

We need either:
a. an error message asking for a folder
b. detect it's a file and process as such

This is actually a case of "duck typing" and we probably don't even need to ask users if it's a folder or a file in the first place

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.