Giter VIP home page Giter VIP logo

mt-dnn's Introduction

Multi-Task Deep Neural Networks for Natural Language Understanding

MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM).

A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains.

This repository is a pip installable package that implements the Multi-Task Deep Neural Networks (MT-DNN) for Natural Language Understanding, as described in the following papers:

Xiaodong Liu*, Pengcheng He*, Weizhu Chen and Jianfeng Gao
Multi-Task Deep Neural Networks for Natural Language Understanding
ACL 2019
*: Equal contribution

Xiaodong Liu, Pengcheng He, Weizhu Chen and Jianfeng Gao
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
arXiv version

Pengcheng He, Xiaodong Liu, Weizhu Chen and Jianfeng Gao
Hybrid Neural Network Model for Commonsense Reasoning
arXiv version

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Jiawei Han
On the Variance of the Adaptive Learning Rate and Beyond
arXiv version

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Tuo Zhao
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
arXiv version

Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao
The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding
arXiv version

Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon and Jianfeng Gao
Adversarial Training for Large Neural Language Models
arXiv version

Pip install package

A setup.py file is provided in order to simplify the installation of this package.

  1. To install the package, please run the command below (from directory root)

    pip install -e .
  2. Running the command tells pip to install the mt-dnn package from source in development mode. This just means that any updates to mt-dnn source directory will immediately be reflected in the installed package without needing to reinstall; a very useful practice for a package with constant updates.

  3. It is also possible to install directly from Github, which is the best way to utilize the package in external projects (while still reflecting updates to the source as it's installed as an editable '-e' package).

    pip install -e git+git@github.com:microsoft/mt-dnn.git@master#egg=mtdnn
  4. Either command, from above, makes mt-dnn available in your conda virtual environment. You can verify it was properly installed by running:

    pip list | grep mtdnn

For Mixed Precision and Distributed Training, please install NVIDIA apex by following instructions here

Run an example

An example Jupyter notebook is provided to show a runnable example using the MNLI dataset. The notebook reads and loads the MNLI data provided for your convenience here. This dataset is mainly used for natural language inference (NLI) tasks, where the inputs are sentence pairs and the labels are entailment indicators.

NOTE: The MNLI data is very large and would need Git LFS installed on your machine to pull it down.

How To Use

  1. Create a model configuration object, MTDNNConfig, with the necessary parameters to initialize the MT-DNN model. Initialization without any parameters will default to a similar configuration that initializes a BERT model. This configuration object can be initialized wit training and learning parameters like batch_size and learning_rate. Please consult the class implementation for all parameters.

    BATCH_SIZE = 16
    MULTI_GPU_ON = True
    MAX_SEQ_LEN = 128
    NUM_EPOCHS = 5
    config = MTDNNConfig(batch_size=BATCH_SIZE, 
                        max_seq_len=MAX_SEQ_LEN, 
                        multi_gpu_on=MULTI_GPU_ON)
  2. Define the task parameters to train for and initialize an MTDNNTaskDefs object. Definition can be a single or multiple tasks to train. MTDNNTaskDefs can take a python dict, yaml or json file with task(s) defintion.

    DATA_DIR = "../../sample_data/"
    DATA_SOURCE_DIR = os.path.join(DATA_DIR, "MNLI")
    tasks_params = {
                    "mnli": {
                        "data_format": "PremiseAndOneHypothesis",
                        "encoder_type": "BERT",
                        "dropout_p": 0.3,
                        "enable_san": True,
                        "labels": ["contradiction", "neutral", "entailment"],
                        "metric_meta": ["ACC"],
                        "loss": "CeCriterion",
                        "kd_loss": "MseCriterion",
                        "n_class": 3,
                        "split_names": [
                            "train",
                            "dev_matched",
                            "dev_mismatched",
                            "test_matched",
                            "test_mismatched",
                        ],
                        "data_source_dir": DATA_SOURCE_DIR,
                        "data_process_opts": {"header": True, "is_train": True, "multi_snli": False,},
                        "task_type": "Classification",
                    },
                }
    
    # Define the tasks
    task_defs = MTDNNTaskDefs(tasks_params)
  3. Create a data tokenizing object, MTDNNTokenizer. Based on the model initial checkpoint, it wraps around the model's Huggingface transformers library to encode the data to MT-DNN format. This becomes the input to the data building stage.

    tokenizer = MTDNNTokenizer(do_lower_case=True)
    
    # Testing out the tokenizer  
    print(tokenizer.encode("What NLP toolkit do you recommend", "MT-DNN is a fantastic toolkit"))  
    
    # ([101, 2054, 17953, 2361, 6994, 23615, 2079, 2017, 16755, 102, 11047, 1011, 1040, 10695, 2003, 1037, 10392, 6994, 23615, 102], None, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
    
  4. Create a data preprocessing object, MTDNNDataBuilder. This class is responsible for converting the data into the MT-DNN format depending on the task. This object is responsible for creating the vectorized data for each task.

    ## Load and build data
    data_builder = MTDNNDataBuilder(tokenizer=tokenizer,
                                    task_defs=task_defs,
                                    data_dir=DATA_SOURCE_DIR,
                                    canonical_data_suffix="canonical_data",
                                    dump_rows=True)
    
    ## Build data to MTDNN Format as an iterable of each specific task
    vectorized_data = data_builder.vectorize()
    
  5. Create a data preprocessing object, MTDNNDataProcess. This creates the training, test and development PyTorch dataloaders needed for training and testing. We also need to retrieve the necessary training options required to initialize the model correctly, for all tasks.

    data_processor = MTDNNDataProcess(config=config, 
                                    task_defs=task_defs, 
                                    vectorized_data=vectorized_data)
    
    # Retrieve the multi task train, dev and test dataloaders
    multitask_train_dataloader = data_processor.get_train_dataloader()
    dev_dataloaders_list = data_processor.get_dev_dataloaders()
    test_dataloaders_list = data_processor.get_test_dataloaders()
    
    # Get training options to initialize model
    decoder_opts = data_processor.get_decoder_options_list()
    task_types = data_processor.get_task_types_list()
    dropout_list = data_processor.get_tasks_dropout_prob_list()
    loss_types = data_processor.get_loss_types_list()
    kd_loss_types = data_processor.get_kd_loss_types_list()
    tasks_nclass_list = data_processor.get_task_nclass_list()
    num_all_batches = data_processor.get_num_all_batches()
  6. Now we can create an MTDNNModel.

    model = MTDNNModel(
        config,
        task_defs,
        pretrained_model_name="bert-base-uncased",
        num_train_step=num_all_batches,
        decoder_opts=decoder_opts,
        task_types=task_types,
        dropout_list=dropout_list,
        loss_types=loss_types,
        kd_loss_types=kd_loss_types,
        tasks_nclass_list=tasks_nclass_list,
        multitask_train_dataloader=multitask_train_dataloader,
        dev_dataloaders_list=dev_dataloaders_list,
        test_dataloaders_list=test_dataloaders_list,
    )
  7. At this point the MT-DNN model allows us to fit to the model and create predictions. The fit takes an optional epochs parameter that overwrites the epochs set in the MTDNNConfig object.

    model.fit(epochs=NUM_EPOCHS)
  8. The predict function can take an optional checkpoint, trained_model_chckpt. This can be used for inference and running evaluations on an already trained PyTorch MT-DNN model.
    Optionally using a previously trained model as checkpoint.

    # Predict using a PyTorch model checkpoint
    checkpt = "./checkpoint/model_4.pt"
    model.predict(trained_model_chckpt=checkpt)

Pre-process your data in the correct format

Depending on what data_format you have set in the configuration object MTDNNConfig, please follow the detailed data format below to prepare your data:

  • PremiseOnly : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .

  • PremiseAndOneHypothesis : two texts, i.e. one premise and one hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis".

  • PremiseAndMultiHypothesis : one text as premise and multiple candidates of texts as hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis_1" \t "hypothesis_2" \t ... \t "hypothesis_n".

  • Sequence : sequence tagging. Data format is "id" \t "label" \t "premise".

FAQ

Did you share the pretrained mt-dnn models?

Yes, we released the pretrained shared embedings via MTL which are aligned to BERT base/large models: mt_dnn_base.pt and mt_dnn_large.pt.

How can we obtain the data and pre-trained models to test to try out?

Yes, we have provided a download script to assist with this.

Why SciTail/SNLI do not enable SAN?

For SciTail/SNLI tasks, the purpose is to test generalization of the learned embedding and how easy it is adapted to a new domain instead of complicated model structures for a direct comparison with BERT. Thus, we use a linear projection on the all domain adaptation settings.

What is the difference between V1 and V2

The difference is in the QNLI dataset. Please refere to the GLUE official homepage for more details. If you want to formulate QNLI as pair-wise ranking task as our paper, make sure that you use the old QNLI data.
Then run the prepro script with flags: > sh experiments/glue/prepro.sh --old_glue
If you have issues to access the old version of the data, please contact the GLUE team.

Did you fine-tune single task for your GLUE leaderboard submission?

We can use the multi-task refinement model to run the prediction and produce a reasonable result. But to achieve a better result, it requires a fine-tuneing on each task. It is worthing noting the paper in arxiv is a littled out-dated and on the old GLUE dataset. We will update the paper as we mentioned below.

Notes and Acknowledgments

BERT pytorch is from: https://github.com/huggingface/pytorch-pretrained-BERT
BERT: https://github.com/google-research/bert
We also used some code from: https://github.com/kevinduh/san_mrc

Related Projects/Codebase

  1. Pretrained UniLM: https://github.com/microsoft/unilm
  2. Pretrained Response Generation Model: https://github.com/microsoft/DialoGPT
  3. Internal MT-DNN repo: https://github.com/microsoft/mt-dnn

How do I cite MT-DNN?

@inproceedings{liu2019mt-dnn,
    title = "Multi-Task Deep Neural Networks for Natural Language Understanding",
    author = "Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1441",
    pages = "4487--4496"
}


@article{liu2019mt-dnn-kd,
  title={Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding},
  author={Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:1904.09482},
  year={2019}
}


@article{he2019hnn,
  title={A Hybrid Neural Network Model for Commonsense Reasoning},
  author={He, Pengcheng and Liu, Xiaodong and Chen, Weizhu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:1907.11983},
  year={2019}
}


@article{liu2019radam,
  title={On the Variance of the Adaptive Learning Rate and Beyond},
  author={Liu, Liyuan and Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Han, Jiawei},
  journal={arXiv preprint arXiv:1908.03265},
  year={2019}
}


@article{jiang2019smart,
  title={SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization},
  author={Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Zhao, Tuo},
  journal={arXiv preprint arXiv:1911.03437},
  year={2019}
}

Contact Information

For help or issues using MT-DNN, please submit a GitHub issue.

For personal communication related to this package, please contact Xiaodong Liu ([email protected]), Yu Wang ([email protected]), Pengcheng He ([email protected]), Weizhu Chen ([email protected]), Jianshu Ji ([email protected]), Emmanuel Awa ([email protected]) or Jianfeng Gao ([email protected]).

Contributing

This project welcomes contributions and suggestions. For more details please check the complete steps to contributing to this repo here.

mt-dnn's People

Contributors

awaemmanuel avatar microsoft-github-operations[bot] avatar microsoftopensource avatar sharatsc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mt-dnn's Issues

Erro!!!

Why can't I find the canonical_data file in MNLI, and when running the example provided in the webpage, I got the following error:

AssertionError Traceback (most recent call last)
/data-tmp/TM-DNN/MT-DNN-master/mtdnn/data_builder_mtdnn.py in load_and_build_data(self, dump_rows)
131 task_load_func = self.supported_tasks_loader_map[name]
--> 132 data = task_load_func(in_file_path, data_opts)
133 processed_rows = process_data_and_dump_rows(

/data-tmp/TM-DNN/MT-DNN-master/mtdnn/tasks/utils.py in load_mnli(file_path, kwargs)
125 blocks = line.strip().split("\t")
--> 126 assert len(blocks) > 9
127 if blocks[-1] == "-":

AssertionError:

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
in
5 data_dir=DATA_SOURCE_DIR_MNLI,
6 canonical_data_suffix="canonical_data",
----> 7 dump_rows=True,
8 )
9

/data-tmp/TM-DNN/MT-DNN-master/mtdnn/data_builder_mtdnn.py in init(self, tokenizer, task_defs, do_lower_case, data_dir, canonical_data_suffix, dump_rows)
196 )
197 self.processed_tasks_data = self.task_data_loader.load_and_build_data(
--> 198 self.save_to_file
199 )
200

/data-tmp/TM-DNN/MT-DNN-master/mtdnn/data_builder_mtdnn.py in load_and_build_data(self, dump_rows)
145 )
146 except Exception as ex:
--> 147 raise IOError(ex)
148 return processed_data
149

OSError:

which happened in :
Data Preprocessing
Create the Data Builder Object

Is this related to my pytorch using version 1.5.0๏ผŸ

Train MTDNN using pretrained models in Huggingface?

I am trying to train MTDNN for sequence classification on a language other than English. What changes do I need to make in the model configuration object MTDNNConfig to account for pretrained model and its vocab size?

I tried making changed to MTDNNConfig.vocab_size and MTDNNConfig.init_checkpoint but it lead to assertion errors in MTDNNModel.supported_init_checkpoints(). What are these supported init checkpoints? @namisan

Hyperparameters for Roberta

Dear Authors,

The current code as used in the example notebook gives very good results with Bert, but poor results with Roberta. This is surprising because the papers show that using Roberta allows much better results.
My guess is therefore that some Hyperparameters are different when Roberta was used in the papers. If this is the case, could a good set of hyperparameters for Roberta be disclosed? If this is not the case, how do you explain the drop in performance?

Best Regards,
Antoine

Formatting Data for Custom Task

Hi,

I have two classifications tasks, one with 1000 classes and another with 100. I wish to train a model on both of these tasks.
I understand defining the task params, as you specified in the classification example notebook.
Though I do not understand, how do I format my datasets? Could you help me with that?

Upon looking closely in sample data, I can see :
uid: a unique identifier
token: list of tokens ( how do I generate this, should I use bert tokenizer to generate these)
label: label string
type: I don't understand what this is?(is this the positional encoding in bert)

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.