tfstbd

Sentence&Token boundary detector implemented with TensorfFlow. This is a model development part of project.

Training custom model

Obtain dataset with already splitted sentences and tokens in CoNNL-U format.

UniversalDependencies is a good choice. But maybe you have more data? Copy your *.conllu files (or just train part) to "data/prepare/" folder.

Convert *.conllu files (with auto-augmentation) into trainer-accepted dataset format.

tfkstbd-dataset data/prepare/ data/ready/

Prepare configuration file with hyperparameters. Start from config/default.json in this module repository.
Extract most frequent non-alphanum ngrams vocabulary from train dataset. This will include "<start" and "end>" ngrams too.

tfkstbd-vocab data/ready/ config/default.json data/vocabulary.pkl

Run training.

First run will only compute start metrics, so you should run repeat this stem multiple times.

tfkstbd-train data/ready/ data/ready/vocabulary.pkl config/default.json model/

Optionally use --eval_data data/ready_eval/ to evaluate model and --export_path export/ to export. You can also provide --threads_count NN flag if you have a lot (>8) of CPU cores.

Test your model on plain text file.

tfkstbd-infer export/<model_version> some_text_document.txt

No training

{'accuracy': 0.96256346, 'accuracy_baseline': 0.96256346, 'auc': 0.5837398, 'auc_precision_recall': 0.07934569, 'average_loss': 0.30893928, 'label/mean': 0.03743653, 'loss': 4676.206, 'precision': 0.0, 'prediction/mean': 0.23343459, 'recall': 0.0, 'global_step': 1, 'f1': 0.0}

TODO: urldecode, entities? г/кВт∙ч. тонн/ТВт∙ч) КП. АМ.

TODO: focal loss

https://github.com/Koziev/rutokenizer

Recommend Projects

shkarupa-alex / tfstbd Goto Github PK

tfstbd's Introduction

tfstbd

Training custom model

No training

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent