thu-coai / cotk Goto Github PK

Conversational Toolkit. An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation

License: Apache License 2.0

Python 95.58% Jupyter Notebook 4.42%

machine-learning natural-language-processing natural-language-generation deep-learning python data-processing text-data cotk metrics

cotk's Introduction

Conversational Toolkits

cotk is an open-source lightweight framework for model building and evaluation. We provides standard dataset and evaluation suites in the domain of general language generation. It easy to use and make you focus on designing your models!

Features included:

Light-weight, easy to start. Don't bother your way to construct models.
Predefined standard datasets, in the domain of language modeling, dialog generation and more.
Predefined evaluation suites, test your model with multiple metrics in several lines.
A dashboard to show experiments, compare your and others' models fairly.
Long-term maintenance and consistent development.

This project is a part of dialtk (Toolkits for Dialog System by Tsinghua University), you can follow dialtk or cotk on our home page.

Note: master branch is the developing branch. The newest release is v0.1.0.

Quick links

Index

Installation
Quick Start
Issues
Contributions
Team
License

Installation

Requirements

python 3
numpy >= 1.13
nltk >= 3.4
tqdm >= 4.30
checksumdir >= 1.1
pytorch >= 1.0.0 (optional, accelerating the calculation of some metrics)
transformers (optional, used for pretrained models)

We support Unix, Windows, and macOS.

Install from pip

You can simply get the latest stable version from pip using

    pip install cotk

Install from source code

Clone the cotk repository

    git clone https://github.com/thu-coai/cotk.git

Install cotk via pip

    cd cotk
    pip install -e .

Quick Start

Let's skim through the whole package to find what you want.

Dataloader

Load common used dataset and do preprocessing:

Download online resources or import from local path
Split training set, development set and test set
Construct vocabulary list

    >>> import cotk.dataloader
    >>> # automatically download online resources
    >>> dataloader = cotk.dataloader.MSCOCO("resources://MSCOCO_small")
    >>> # or download from a url
    >>> dl_url = cotk.dataloader.MSCOCO("http://cotk-data.s3-ap-northeast-1.amazonaws.com/mscoco_small.zip#MSCOCO")
    >>> # or import from local file
    >>> dl_zip = cotk.dataloader.MSCOCO("./MSCOCO.zip#MSCOCO")

    >>> print("Dataset is split into:", dataloader.fields.keys())
    dict_keys(['train', 'dev', 'test'])

Inspect vocabulary list

    >>> print("Vocabulary size:", dataloader.frequent_vocab_size)
    Vocabulary size: 2597
    >>> print("First 10 tokens in vocabulary:", dataloader.frequent_vocab_list[:10])
    First 10 tokens in vocabulary: ['<pad>', '<unk>', '<go>', '<eos>', '.', 'a', 'A', 'on', 'of', 'in']

Convert between ids and strings

    >>> print("Convert string to ids", \
    ...           dataloader.convert_tokens_to_ids(["<go>", "hello", "world", "<eos>"]))
    Convert string to ids [2, 6107, 1875, 3]
    >>> print("Convert ids to string", \
    ...           dataloader.convert_ids_to_tokens([2, 1379, 1897, 3]))
	Convert ids to string ['hello', 'world']

Iterate over batches

    >>> for data in dataloader.get_batches("train", batch_size=1):
    ...     print(data)
    {'sent':
        array([[ 2, 181, 13, 26, 145, 177, 8, 22, 12, 5, 1, 1099, 4, 3]]),
        # <go> This is an old photo of people and a <unk> wagon.
     'sent_allvocabs':
        array([[ 2, 181, 13, 26, 145, 177, 8, 22, 12, 5, 3755, 1099, 4, 3]]),
        # <go> This is an old photo of people and a horse-drawn wagon.
     'sent_length': array([14])}
    ......

Or using while (another iteration method) if you like

    >>> dataloader.restart("train", batch_size=1):
    >>> while True:
    ...    data = dataloader.get_next_batch("train")
    ...    if data is None: break
    ...    print(data)

note: If you want to know more about Dataloader, please refer to docs of dataloader.

Metrics

We found there are different versions of the same metric in different papers, which leads to unfair comparison between models. For example, whether considering unk, calculating the mean of NLL across sentences or tokens in perplexity may introduce huge differences.

We provide a unified implementation for metrics, where hashvalue is provided for checking whether the same data is used. The metric object receives mini-batches.

    >>> import cotk.metric
    >>> metric = cotk.metric.SelfBleuCorpusMetric(dataloader, gen_key="gen")
    >>> metric.forward({
    ...    "gen":
    ...        [[2, 181, 13, 26, 145, 177, 8, 22, 12, 5, 3755, 1099, 4, 3],
    ...         [2, 46, 145, 500, 1764, 207, 11, 5, 93, 7, 31, 4, 3]]
    ... })
    >>> print(metric.close())
    {'self-bleu': 0.02253475750490193, 'self-bleu hashvalue': 'f7d75c0d0dbf53ffba4b845d1f61487fd2d6d3c0594b075c43111816c84c65fc'}

You can merge multiple metrics together by cotk.metric.MetricChain.

    >>> metric = cotk.metric.MetricChain()
    >>> metric.add_metric(cotk.metric.SelfBleuCorpusMetric(dataloader, gen_key="gen"))
    >>> metric.add_metric(cotk.metric.FwBwBleuCorpusMetric(dataloader, reference_test_list=dataloader.get_all_batch('test')['sent_allvocabs'], gen_key="gen"))
    >>> metric.forward({
    ...    "gen":
    ...        [[2, 181, 13, 26, 145, 177, 8, 22, 12, 5, 3755, 1099, 4, 3],
    ...         [2, 46, 145, 500, 1764, 207, 11, 5, 93, 7, 31, 4, 3]]
    ... })
    >>> print(metric.close())
    100%|██████████| 1000/1000 [00:00<00:00, 5281.95it/s]
	{'self-bleu': 0.02253475750490193, 'self-bleu hashvalue': 'f7d75c0d0dbf53ffba4b845d1f61487fd2d6d3c0594b075c43111816c84c65fc', 'fw-bleu': 0.28135593382545376, 'bw-bleu': 0.027021522872801896, 'fw-bw-bleu': 0.04930753293488745, 'fw-bw-bleu hashvalue': '60a39f381e065e8df6fb5eb272984128c9aea7dee4ba50a43bfb768395a70762'}

We also provide recommended metrics for selected dataloader.

    >>> metric = dataloader.get_inference_metric(gen_key="gen")
    >>> metric.forward({
    ...    "gen":
    ...        [[2, 181, 13, 26, 145, 177, 8, 22, 12, 5, 3755, 1099, 4, 3],
    ...         [2, 46, 145, 500, 1764, 207, 11, 5, 93, 7, 31, 4, 3]]
    ... })
    >>> print(metric.close())
    100%|██████████| 1000/1000 [00:00<00:00, 4857.36it/s]
	100%|██████████| 1250/1250 [00:00<00:00, 4689.29it/s]
	{'self-bleu': 0.02253475750490193, 'self-bleu hashvalue': 'f7d75c0d0dbf53ffba4b845d1f61487fd2d6d3c0594b075c43111816c84c65fc', 'fw-bleu': 0.3353037449663603, 'bw-bleu': 0.027327995838287513, 'fw-bw-bleu': 0.050537105917262654, 'fw-bw-bleu hashvalue': 'c254aa4008ae11b1bc4955e7cd1f7f3aad34b664178a585a218b1474970e3f23', 'gen': [['inside', 'is', 'an', 'elephant', 'shirt', 'of', 'people', 'and', 'a', 'grasslands', 'pulls', '.'], ['An', 'elephant', 'girls', 'baggage', 'sidewalk', 'with', 'a', 'clock', 'on', 'it', '.']]}

note: If you want to know more about metrics, please refer to docs of metrics.

Predefined Models

We have provided some baselines for the classical tasks, see Model Zoo in docs for details.

You can also use cotk download thu-coai/MODEL_NAME/master to get the codes.

Issues

You are welcome to create an issue if you want to request a feature, report a bug or ask a general question.

Contributions

We welcome contributions from community.

If you want to make a big change, we recommend first creating an issue with your design.
Small contributions can be directly made by a pull request.
If you like make contributions for our library, see issues to find what we need.

Team

cotk is maintained and developed by Tsinghua university conversational AI group (THU-coai). Check our main pages (In Chinese).

License

Apache License 2.0

cotk's People

Contributors

Stargazers

Watchers

cotk's Issues

[Enhancement] assert metrics don't have side effects

Metric forward function shouldn't change any inputs.

Add asserts into unittest.

[Enhancement] Add metric including models

Some metrics need models evaluating or training, we have to build a framework for them:

CVAE - dialog act classifier
GAN - forward perplexity & backward perplexity & wordvec embedding similarity

wordvec refer to https://github.com/wlin12/wang2vec

[Enhancement] fill unittest

TODO: Some small test

[Enhancement] unittest & code coverage

[Maintenance] Add test in model to make sure hash is same in multiple evaluation

[Model] LSTM language modelling

Write a model for Language Generation Dataloader. Either in tensorflow or pytorch.

If you write in tensorflow, please use a newer version of tensorflow like 1.13.

Tests are required.

[BUG] bug in trim_index of dataloader

Describe the bug
trim_index will get error when wasn't showed up:

IndexError: list index out of range

Expected behavior
don't trim when is not met

Additional context
https://github.com/thu-coai/contk/blob/7e41e43d5eb4af4881bb5e61a338025ab9f77858/contk/dataloader/language_generation.py#L198-L220

and in other dataloaders.

Consider modify index_to_sen behavior. Because there won't be but pad sometimes

[BUG] typo in metric.py

Describe the bug
PerlplexityMetric ->PerplexityMetric

Move ./tests/dataloader/test_metric to ./tests/metric/test_metric

[Model] CVAE

Refer to Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders

[Maintenance] Check the license of dataset

mscoco
OpenSubtitles
UbuntuCorpus

Check if we can redistribution the data.

[Models] CopyNet

Refer to Incorporating Copying Mechanism in Sequence-to-Sequence Learning.

[Enhancement] Make models docs

Requirements:

Turn models README.md to a rst file in docs

[Enhancement] Adapt test for metric using allvocabs

Description:

Now dataloader have added new attributes: valid vocabs and invalid vocabs
valid vocabs mean the vocabularies used by models
all vocabs(== valid vocabs + invalid vocabs) mean the vocabularies used by metrics.
If a word is not any kind of all vocabs, it is unkown vocabs, which are ignored by metrics.

Metric unittest must be adapted for new metrics.

Requirements:

Pull invalid_vocab branch
FakeDataloader should have new attributes like all_vocab_size, ...
Bleu & Recorder metrics have to use all vocabs
Perplexity used a smoothing algorithm (You can see the code in PerlplexityMetric as reference):
- If models predict valid vocabs, perplexity is calculated as it was
- If models predict UNK, the probability is divided evenly to invalid vocabs
- If the reference is UNK, the word is ignored.
  So, you have to write tests for the new PerplexityMetric and MultiturnPerplexityMetric
  Try to cover the 3 conditions above.

[Dataloader] Multiturn Dialog

Ubuntu Dialog Corpus

refer to http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/

hred model need refactor

[Enhancement] update seq2seq-tensorflow

pump up to a newer tensorflow version
fix all the warnings in test

[Maintenance] Refactor dataloader of SwitchBoard

_build_vocab has to use multi_ref data
renamed to inference metric. embedding should have a default realization (use wordvec from Glove)
add unittest for unique feature of SwitchBoard

add hashvalue

[BUG] MultiTurnPerplexityMetric

Describe the bug
https://github.com/thu-coai/contk/blob/e6e3d641766e4ae2111f41e742be206cc8684d2c/contk/metric/metric.py#L147-L150

Use multiturn as batch_size in sub_metric

Expected behavior
Add some comments to explain this.

[Dataloader] language generation

refer to MSCOCO ?

[Model] RL for Dialogue

Refer to Deep Reinforcement Learning for Dialogue Generation.

[Enhancement] gather download links of data

Gather the download links of data, make a 'dataset_config.json' in ./contk/dataloader

{
"MSCOCO": "https://XXXX"
}

It is best reference from the original link, can use gzip or other compressed format.

[Enhancement] Metrics check whether models use the same data

Problems

It may be hard to evaluate 2 models using the same test data in the same way.
So it's important to make the metrics be able to telling which data is used.

Proposal A

Make metrics binding the dataloader. Data must be processed in the same order.

Drawback:

must be in same order

Proposal B

Make a hash value of data. It's able to tell the differences.

Drawback:

hard to find bugs

[BUG] bleu will crash

Describe the bug
BleuMetric will crashed when len(hypothesis) == 1?
possible because of smoothingFunction?

It's an upstreaming bug, just comment and give up

To Reproduce

checked

[Maintenance] Fix docs and add hints for dataloader

[Enhancement] fill docstrings

TODO:

MultiTurnDialog
_load_data

[Enhancement] enhance metric test

https://github.com/thu-coai/contk/blob/e6e3d641766e4ae2111f41e742be206cc8684d2c/tests/dataloader/test_metric.py#L61-L76

make the length of sentence different in different turns.
make the length of turn different in different batches

[Model] HRED

Refer to Building end-to-end dialogue systems using generative hierarchical neural network models

[Maintenance] Add metric hash to hash recorder

Metric may have some information to reference. To make sure it is unique, put it in hash.

Eg: self-bleu need whole test set for unique.

[Enhancement] remove eot from multiturn dialog

Requirement:

remove <eot> from multiturn dialog, use <eos> instead. Don't using any mark between sentences.

[Enhancement] Make unit test for models

Requirement

Run models test only in cpu mode
Just check the arguments and the connection with the main library
Don't need to check performance
make the test standalone, because it may need packages like tensorflow or pytorch.

[Enhancement] Add dataset description on doc

Requirements:

modify ./docs/source/ to add description for models

[Maintenance] Refactor metric.py

split it to multiple files

[Metric] perplexity considering UNK

perplexity sometimes should ignore UNK.

waiting #38 (over)

[Maintenance] Change contk to cotk

[Maintenance] Rename BaseLanguageGeneration?

BaseLanguageGeneration is easily confused with LanguageGeneration
Change LanguageGeneration -> LanguageModeling ?

[Model] SeqGAN

Refer to SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

[BUG] typo in UbuntuCorpus

Describe the bug
https://github.com/thu-coai/contk/blob/02e980498b534a174baea59dfbe9f42291233010/contk/dataloader/multi_turn_dialog.py#L199

[Feature] Use a stable link on github for data

User may use same id to download same data from different sources:

like “glove” default from github
"glove~github" explicit from github
"glove~tsinghua" explicit download from coai.tsinghua

[Maintenance] update glove for better path

[Maintenance] Add test for file_utils and resource_processor

[Enhancement] Vocab List in Dataloader

For implemention of #8 copynet, dataloader should change behaviours.

In our mind, there should be 3 vocab list:

For model trainning, smallest. Only include words from train set. Call it set $V.
For metric, bigger. The model will be evaluated on this vocab list, including words from train set and test set. Call it set $M. But almostly all models can't generate words from $V-$M, because they haven't seen these. Howerver, copyNet can gen words from $V-$M by copy mechanism. It's necessary to take these words into accounts when we implement metrics. $V-$M can be expressed as UNK token for some models. Dataloader have to tranlate them into a uniform distribution on $V-$M.
The whole space of word, include not seen in all the data. Call it set $N. The words in $N-$M, we don't care about them, ignore them in evaluating models, as #37 . $N-$M is the TRUE UNK.

Require:

Change the behavior of dataloader, metric.

[Enhancement] seed in tests

Use seed in tests for debugging

[BUG] fix hred test

Describe the bug
hred test is wrong.

Why the turn of generated sentences > turn of reference ???

[Enhancement] download module

Requirements:

Download data from net (Use cache)
Import data from local system (label it by hash value)
Config in json type
Put it in ./contk/_utils/file_utils.py

Code refer to
https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/file_utils.py

Pay attention to Apache LICENSE

[Enhancement] Refactor request

Refactor to eliminate duplicate

[Enhancement] change api in language_generation::get_batch

change sentence to sent?

change all sen to sent

Need recheck

waiting #107

[Enhancement] metric add hints, docs & examples

Add hints for invalid input in metric.py

For example, missing start token or end token.

Reorganize docs for metric.py

[Feature] Report system

Write a script that push results to dashboard

Command:
'''
cotk-report [--result result.json] [--only-upload] [--entry main] [other parameter]
'''
result: indicates the test results.
only-upload: indicates push results without running model
entry: means the entry point of models

If running in only upload, the result should be comparable
If runing in full mode, the result can reproducible

Provide a list of api for dashboard

[pytorch_modules] decoder

Build framework

[BUG] some bugs in seq2seq-pytorch

If there is no best checkpoint (which means it restore from the previous run), load the best checkpoint will be an error.

[Feature] Rename fileutils to Downloader

Give a url and return a local cached path.

Put in cotk.downloader instead of _utils

thu-coai / cotk Goto Github PK

cotk's Introduction

Conversational Toolkits

Installation

Requirements

Install from pip

Install from source code

Quick Start

Dataloader

Metrics

Predefined Models

Issues

Contributions

Team

License

cotk's People

Contributors

Stargazers

Watchers

Forkers

cotk's Issues

Recommend Projects

Recommend Topics

Recommend Org