Giter VIP home page Giter VIP logo

fastcws

轻量级高性能中文分词项目

动图演示

如标题所言,fastcws性能极高。从动图中可以看出,fastcws冷启动加载只用了 0.125s;冷启动加上分词 18 万字只用了 0.35s。简单估算一下,已经达到了单核百万字的水准!

命令行工具

fastcws命令行工具(从源码编译的话,位于src/tools/fastcws)可以直接将stdin的输入按句分词后输出到stdout

$ fastcws
在春风吹拂的季节翩翩起舞
在/春风/吹拂/的/季节/翩翩起舞/

可以用管道方便的将文件分词后,转储到另一个文件:

$ cat input.txt | fastcws > output.txt

此外,还支持自定义分隔符、从文件加载词典、HMM模型等,详见fastcws --help

Windows 注意事项

Windows平台上,默认的编码是utf16,但是本项目目前只使用utf8作为唯一编码。

在直接用命令行界面进行输入时,无需考虑此问题,因为工具使用了nowide进行自动转换:

$ fastcws
在春风吹拂的季节翩翩起舞
在/春风/吹拂/的/季节/翩翩起舞/

在使用管道分词文件时,必须确认文件以utf8格式保存且不带 BOM,否则可能导致分词工作不正常或者出现错误:

$ type input.txt | fastcws.exe > output.txt

必须保证input.txt是以utf8格式保存的。

C语言函数库

本项目以c++17写成,不过可以使用编译得到的动态链接库,以稳定的 C 语言 API 调用分词组件:

// #include "libfastcws.h"

fastcws_init();
fastcws_result* result = fastcws_alloc_result();

int err = fastcws_word_break("在春风吹拂的季节翩翩起舞", result);
if (err) {
	...
}
const char *word_begin;
size_t word_len;
while(fastcws_result_next(result, &word_begin, &word_len) == 0) {
	...
}
fastcws_result_free(result);

如你所见,分词是0拷贝的,因此性能十分优秀。

此外,C API 同样支持从文件加载词典、HMM模型等。examples目录下有更多范例可供参考。

同样需要注意的是,传入的数据编码必须是utf8

编译安装

和多数cmake项目一样:

git submodule update --init --recursive
cmake -S . -B build
cmake --build build
cmake --build build --target install

fastcws's Projects

assets icon assets

此仓库用于存放fastcws中文分词工具默认使用的词频词典和隐马尔可夫模型

fastcws icon fastcws

轻量级高性能中文分词项目

zlib icon zlib

A massively spiffy yet delicately unobtrusive compression library.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.