Please have a look to the following :

Here the complete one : <div class="highlight highlight-source-python notranslate

Missing syllabes and punctuations about botok HOT 3 CLOSED

openpecha commented on August 21, 2024

Missing syllabes and punctuations

from botok.

Comments (3)

thubtenrigzin commented on August 21, 2024

Here the complete one :

from pybo import *

issues = ["འཐུང་བུད་", "ཨ་དྷྱིད་ཤུ་ཀ་ར་", "ཀི་བི་ཏི་སྭཱ་", "ལང་ཏང་ཙེ་དང་བྱེ་", "ད་མེད་བྷ་གར་", "རབ་བསྐུས་ནས།",
        "གདབ། །ཨོཾ་ན་", "བི་སི་ནི་", "བསྐོལ། །རྡོ་རྗེ་", "བསྐུས་ཤིང་མཉེས་", "སྦོམ་ཞིང་ཆེ་", "བྷ་ག་ཁ་ཆེ་", "།ཨོཾ་གི་རི་ཧི་རི་ཙི་རི། །ཨཱ་ཨཱ་ཤུ་མ་ཤ་", "བཟླས་བྱས་",
        "བསྣམས། །རྩི་", "ནཱ་ཤ་", "གནོས་སྨྱོ་བྱེད་བརྗེད་", "གདོད་ཟིན་པ", "བསྲེས་དམར་ནག་", "བརྗེ་ཞིང་བསྐྱར་", "འཆང་མའི།", "དམོད་གཟུག་",
          "བཤགས་ན་བུ་", "མཐོལ་མགོ་ལ་", "ཧུ་ཧཾ།", "སྲི་མོ་བཛྲ་ནོ་ཏི་སྟ་ཀཱི་", "གུམ་དང་།", "ཡོལ་གྱིས", "སྐུད་སྣ་", "བཀྲ་མ།", "གདོད་པར་བྱ", "བསྒྲིབས་ཡོངས་སུ་", "དྲངས་ནས།", "རཱུ་ཏྲ་ཀྵ་གནས་",
          "ལྡང་པ་ན།", "བསྲུབས་བྱས་པས། །ལྟེ་ལྐོག་", "བསྟུན་ལ་ཉམས་", "ཥ་ཡིག་རྣམ་", "འཛོམ། །རྣོ་", "པྲི་ཡིག་དམར་", "གཏུམ་བྱེད་དང་", "ཞིབ་བས་སྦལ།", "གཅོད་འཁོར་ལོ་", "བཏུལ་མཚམས་བཅད་པ",
          "ཞལ་བྷ་ག་", "བསྐུར་ལས་ཀྱི་", "འཁོས་དུ། །ཆེ་", "ནུ་ཧེ་རུ་", "བརྩེགས་རྣམ་པར་", "བྷ་གར་", "ནུ་ཡེ་ཤེས་", "བརྩེགས་ངེས་པ་", "བཟླས་བསྐུལ་གསུང་", "བྷ་གར་འཁྱིལ། །ཨོཾ་",
          "བྷ་གར་སྦྱོར་", "བརྒྱུད་སྐུ་གདུང་", "སྒལ་བརྒྱུད་ཞབས་", "བརྩེགས་ཆེ་མཆོག་", "།་གླེན་ལྐུགས་"]
details = ["བུད་", "ཀ་", "ཏི་", "དང་ and དང་", "གར་", "ནས་", "space and །", "སི་", "space and །", "ཤིང་", "ཞིང་", "ག་", "རི་ རི་ ཨཱ་ མ་", "བྱས་",
           "space and །", "ཤ་", "བྱེད་", "ཟིན་", "དམར་", "ཞིང་", "མའི།", "གཟུག་", "ན་", "མགོ་", "ཧཾ", "མོ་ སྟ་", "དང་", "གྱིས", "སྣ་", "མ", "པར་", "ཡོངས་", "ནས", "ཀྵ་",
           "པ་", "བྱས་ ལྐོག་", "ལ་", "ཡིག་", "space and །", "ཡིག་", "བྱེད་", "བས་", "འཁོར་", "མཚམས་བཅད་", "ག་", "ལས་", "དུ", "ཧེ་", "རྣམ་", "གར་", "ཡེ་", "ངེས་", "བསྐུལ་", "གར་ space and །",
           "གར་", "སྐུ་", "བརྒྱུད་", "ཆེ་", "་ before གླེན་"]

tok = BoTokenizer("POS")

for i, b in enumerate(issues):
    token = tok.tokenize(b)

    result ="".join(t.content for t in token)
    print("Missing : %s for %s - Tokeniser output : %s" % (details[i], issues[i], result))

from botok.

thubtenrigzin commented on August 21, 2024

I found out that if you change the profile, the issues are different:

Give you an other script to let you understand what I mean:

Here I have change to GMD profile instead of POS

from pybo import *

issues = ["འཐུང་བུད་", "ཨ་དྷྱིད་ཤུ་ཀ་ར་", "ཀི་བི་ཏི་སྭཱ་", "etc...", "ལང་ཏང་ཙེ་དང་བྱེ་", "ད་མེད་བྷ་གར་", "རབ་བསྐུས་ནས།",
        "གདབ། །ཨོཾ་ན་", "བི་སི་ནི་", "བསྐོལ། །རྡོ་རྗེ་", "བསྐུས་ཤིང་མཉེས་", "སྦོམ་ཞིང་ཆེ་", "བྷ་ག་ཁ་ཆེ་", "།ཨོཾ་གི་རི་ཧི་རི་ཙི་རི། །ཨཱ་ཨཱ་ཤུ་མ་ཤ་", "བཟླས་བྱས་",
        "བསྣམས། །རྩི་", "ནཱ་ཤ་", "གནོས་སྨྱོ་བྱེད་བརྗེད་", "གདོད་ཟིན་པ", "བསྲེས་དམར་ནག་", "བརྗེ་ཞིང་བསྐྱར་", "འཆང་མའི།", "དམོད་གཟུག་",
          "བཤགས་ན་བུ་", "མཐོལ་མགོ་ལ་", "ཧུ་ཧཾ།", "སྲི་མོ་བཛྲ་ནོ་ཏི་སྟ་ཀཱི་", "གུམ་དང་།", "ཡོལ་གྱིས", "སྐུད་སྣ་", "བཀྲ་མ།", "གདོད་པར་བྱ", "བསྒྲིབས་ཡོངས་སུ་", "དྲངས་ནས།", "རཱུ་ཏྲ་ཀྵ་གནས་",
          "ལྡང་པ་ན།", "བསྲུབས་བྱས་པས། །ལྟེ་ལྐོག་", "བསྟུན་ལ་ཉམས་", "ཥ་ཡིག་རྣམ་", "འཛོམ། །རྣོ་", "པྲི་ཡིག་དམར་", "གཏུམ་བྱེད་དང་", "ཞིབ་བས་སྦལ།", "གཅོད་འཁོར་ལོ་", "བཏུལ་མཚམས་བཅད་པ",
          "ཞལ་བྷ་ག་", "བསྐུར་ལས་ཀྱི་", "འཁོས་དུ། །ཆེ་", "ནུ་ཧེ་རུ་", "བརྩེགས་རྣམ་པར་", "བྷ་གར་", "ནུ་ཡེ་ཤེས་", "བརྩེགས་ངེས་པ་", "བཟླས་བསྐུལ་གསུང་", "བྷ་གར་འཁྱིལ། །ཨོཾ་",
          "བྷ་གར་སྦྱོར་", "བརྒྱུད་སྐུ་གདུང་", "སྒལ་བརྒྱུད་ཞབས་", "བརྩེགས་ཆེ་མཆོག་", "།་གླེན་ལྐུགས་"]
details = ["Now བུད་ appears", "ཀ་ also", "but with ཏི་, the problem remains", "", "དང་ and དང་", "གར་", "ནས་", "space and །", "སི་", "space and །", "ཤིང་", "ཞིང་", "ག་", "རི་ རི་ ཨཱ་ མ་", "བྱས་",
           "space and །", "ཤ་", "བྱེད་", "ཟིན་", "དམར་", "ཞིང་", "མའི།", "གཟུག་", "ན་", "མགོ་", "ཧཾ", "མོ་ སྟ་", "དང་", "གྱིས", "སྣ་", "མ", "པར་", "ཡོངས་", "ནས", "ཀྵ་",
           "པ་", "བྱས་ ལྐོག་", "ལ་", "ཡིག་", "space and །", "ཡིག་", "བྱེད་", "བས་", "འཁོར་", "མཚམས་བཅད་", "ག་", "ལས་", "དུ", "ཧེ་", "རྣམ་", "གར་", "ཡེ་", "ངེས་", "བསྐུལ་", "གར་ space and །",
           "གར་", "སྐུ་", "བརྒྱུད་", "ཆེ་", "་ before གླེན་"]

tok = BoTokenizer("GMD")
print()
print("Here BoTokenizer uses GMD profile instead of POS, the issues are not the same...")
print()
for i, b in enumerate(issues):
    token = tok.tokenize(b)

    result ="".join(t.content for t in token)
    print("%s for %s - Tokeniser output : %s" % (details[i], issues[i], result))

from botok.

thubtenrigzin commented on August 21, 2024

issues are fixed in the 0.2.0 release

from botok.

Missing syllabes and punctuations about botok HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent