Giter VIP home page Giter VIP logo

berserker's Introduction

Berserker

Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's BERT model.

Installation

pip install basaka

Usage

import berserker

berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']

Benchmark

The table below shows that Berserker achieved state-of-the-art F1 measure on the SIGHAN 2005 dataset.

The result below is trained with 15 epoches on each dataset with a batch size of 64.

PKU CITYU MSR AS
Liu et al. (2016) 96.8 -- 97.3 --
Yang et al. (2017) 96.3 96.9 97.5 95.7
Zhou et al. (2017) 96.0 -- 97.8 --
Cai et al. (2017) 95.8 95.6 97.1 --
Chen et al. (2017) 94.3 95.6 96.0 94.6
Wang and Xu (2017) 96.5 -- 98.0 --
Ma et al. (2018) 96.1 97.2 98.1 96.2
-------------------- ---------- ---------- ---------- ----------
Berserker 96.6 97.1 98.4 96.5

Reference: Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Limitation

Since Berserker is muscular is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.

Currently the default model provided is trained with SIGHAN 2005 PKU dataset. We plan to release more pretrained model in the future.

Architecture

Berserker is fine-tuned over TPU with pretrained Chinese BERT model. It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.

Training

We provided the source code for training under the trainer subdirectory. Feel free to contact me if you need any help reproducing the result.

Bonus Video

Yachae!! BERSERKER!!

berserker's People

Contributors

hoiy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tanbro

berserker's Issues

TypeError: 'function' object is not iterable

func = lambda x: tqdm(x, total=total_size_mb, unit='MB', unit_scale=True) if verbose else lambda x:x
temp_func=func(r.iter_content(1024 * 1024))
print(temp_func)
for chunk in temp_func:
    f.write(chunk)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.