Giter VIP home page Giter VIP logo

cc-cedict-mecab's Introduction

CC-CEDICT-MeCab

CC-CEDICT-MeCab is a MeCab dictionary for Chinese (Mandarin) text segmentation. It supports both traditional and simplified characters.

This dictionary was converted from CC-CEDICT. MeCab provides training function based on annotated data created by (usually) language specialists and text tokenization function based on trained data.

Cost estimation.

Costs are usually estimated by machine learning methods. MeCab provides CRF based training function.

In this project, we did not train with annotated data but use CC-CEDICT vocabulary and rough cost estimation.

    cost = (int)max(-36000, -400 * (length^1.5)) 

Say, there are "日本" (Japan) and "日本人" (Japanese people), and "人" (person / people). Then their costs are -1131, -2078, and -400. Therefore

    cost("日本 人") = -1131 + (-400) = -1531 > cost("日本人") = 2078

Hence "日本人" will be chosen.

Traditional <--> simplified character converter

Build

[mecab-cedict]# /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index -f utf-8 -t utf-8

./pos-id.def is not found. minimum setting is used
reading ./unk.def ... 11
emitting double-array: 100% |###########################################|
./model.def is not found. skipped.
./pos-id.def is not found. minimum setting is used
reading ./cedict.csv ... 187741
emitting double-array: 100% |###########################################|
reading ./matrix.def ... 1x1

done!

Examples

Basic usage

[mecab-cedict]# mecab -d .
武汉市解除离汉离鄂通道管控措施
武汉市	,,,,Wu3 han4 shi4,武漢市,武汉市,Wuhan city on Changjiang
解除	,,,,jie3 chu2,解除,解除,to remove/to sack/to get rid of/to relieve (sb of their duties)/to free/to lift (an embargo)/to rescind (an agreement)/
离	,,,,li2,離,离,to leave/to part from/to be away from/(in giving distances) from/without (sth)/independent of/one of the Eight Trigrams 八卦[ba1 gua4]
汉	,,,,han4,漢,汉,man/
离	,,,,li2,離,离,to leave/to part from/to be away from/(in giving distances) from/without (sth)/independent of/one of the Eight Trigrams 八卦[ba1 gua4]
鄂	,,,,E4,鄂,鄂,abbr. for Hubei Province 湖北省[Hu2 bei3 Sheng3] in central China/surname E/
通道	,,,,tong1 dao4,通道,通道,(communications) channel/thoroughfare/passage/
管控	,,,,guan3 kong4,管控,管控,to control/
措施	,,,,cuo4 shi1,措施,措施,measure/step/CL:個|个[ge4]/
EOS

Converting to traditional characters

[mecab-cedict]# mecab -d . -Otraditional
武汉市解除离汉离鄂通道管控措施
武漢市 解除 離 漢 離 鄂 通道 管控 措施

Converting to simplified characters

[mecab-cedict]# mecab -d .
近期自煮防疫已成了最新飲食觀
近期	,,,,jin4 qi1,近期,近期,near in time/in the near future/very soon/recent/
自	,,,,zi4,自,自,self/oneself/from/since/naturally/surely/
煮	,,,,zhu3,煮,煮,to cook/to boil/
防疫	,,,,fang2 yi4,防疫,防疫,disease prevention/protection against epidemic/
已	,,,,yi3,已,已,already/to stop/then/afterwards/
成了	,,,,cheng2 le5,成了,成了,to be done/to be ready/that's enough!/that will do!/
最新	,,,,zui4 xin1,最新,最新,latest/newest/
飲食	,,,,yin3 shi2,飲食,饮食,food and drink/diet/
觀	,,,,guan4,觀,观,Taoist monastery/palace gate watchtower/platform/
EOS

[mecab-cedict]# mecab -d . -Osimplified
近期自煮防疫已成了最新飲食觀
近期 自 煮 防疫 已 成了 最新 饮食 观

cc-cedict-mecab's People

Contributors

ueda-keisuke avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.