Giter VIP home page Giter VIP logo

pypinyin-g2pw's Introduction

pypinyin-g2pW

基于 g2pW 提升 pypinyin 的准确性。

特点:

  • 可以通过训练模型的方式提升拼音准确性。
  • 功能和使用习惯与 pypinyin 基本保持一致,支持多种拼音风格。

使用

安装依赖

  1. 安装 PyTorch

  2. 下载并解压 G2PWModel:

    wget https://storage.googleapis.com/esun-ai/g2pW/G2PWModel-v2-onnx.zip
    unzip G2PWModel-v2-onnx.zip
    
  3. 安装 git-lfs

  4. 下载 bert-base-chinese:

    git lfs install
    git clone https://huggingface.co/bert-base-chinese
    
  5. 安装本项目:

    pip install pypinyin-g2pw
    

使用示例

>>> from pypinyin import Style
>>> from pypinyin_g2pw import G2PWPinyin

# 需要将 model_dir 和 model_source 的值指向下载的模型数据目录
>>> g2pw = G2PWPinyin(model_dir='G2PWModel/',
                      model_source='bert-base-chinese/',
                      v_to_u=False, neutral_tone_with_five=True)
>>> han = '然而,他红了20年以后,他竟退出了大家的视线。'

# def lazy_pinyin(self, hans, style=Style.NORMAL, errors='default', strict=True, **kwargs)
# 通过 lazy_pinyin 方法获取拼音数据,各个参数的含义和作用跟 pypinyin 中是一样的,
# v_to_u 和 neutral_tone_with_five 参数只能在初始化 G2PWPinyin 时指定。

>>> g2pw.lazy_pinyin(han)
['ran', 'er', ',', 'ta', 'hong', 'le', '20', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。']

>>> g2pw.lazy_pinyin(han, style=Style.TONE)
['rán', 'ér', ',', 'tā', 'hóng', 'le', '20', 'nián', 'yǐ', 'hòu', ',', 'tā', 'jìng', 'tuì', 'chū', 'le', 'dà', 'jiā', 'de', 'shì', 'xiàn', '。']

>>> g2pw.lazy_pinyin(han, style=Style.TONE3)
['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。']

离线使用

默认情况下,即便使用了离线的模型数据,程序使用的 transformers 模块仍旧会从 huggingface.co 下载部分模型元数据。 可以通过设置环境变量 TRANSFORMERS_OFFLINE=1 以及环境变量 HF_DATASETS_OFFLINE=1 禁用获取元数据的操作,实现完全离线使用的需求。 详见 transformers 官方文档

模型训练

详见 g2pW 官方文档中的说明。

pypinyin-g2pw's People

Contributors

gitycc avatar mozillazg avatar shuishu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pypinyin-g2pw's Issues

部分汉字的拼音有误

列如我在测试的时候,使用"康"这个字,会返回错误的结果[['kāng', 'kàng']],但是"康"只有一个读音kāng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.