Giter VIP home page Giter VIP logo

bartpho's Introduction

Table of contents

  1. Introduction
  2. Using BARTpho with transformers
  3. Using BARTpho with fairseq
  4. Notes

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

We present BARTpho with two versions, BARTpho-syllable and BARTpho-word, which are the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. BARTpho uses the "large" architecture and the pre-training scheme of the sequence-to-sequence denoising autoencoder BART, thus it is especially suitable for generative NLP tasks. We conduct experiments to compare our BARTpho with its competitor mBART on a downstream task of Vietnamese text summarization and show that: in both automatic and human evaluations, BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We further evaluate and compare BARTpho and mBART on the Vietnamese capitalization and punctuation restoration tasks and also find that BARTpho is more effective than mBART on these two tasks.

The general architecture and experimental results of BARTpho can be found in our paper:

@inproceedings{bartpho,
    title     = {{BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese}},
    author    = {Nguyen Luong Tran and Duong Minh Le and Dat Quoc Nguyen},
    booktitle = {Proceedings of the 23rd Annual Conference of the International Speech Communication Association},
    year      = {2022}
}

Please CITE our paper when BARTpho is used to help produce published results or incorporated into other software.

Using BARTpho in transformers

Installation

  • Install transformers with pip: pip install transformers, or install transformers from source.
    Note that we merged a slow tokenizer for BARTpho into the main transformers branch. The process of merging a fast tokenizer for BARTpho is in the discussion, as detailed in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip install -e .
  • Install sentencepiece and tokenizers with pip: pip install sentencepiece tokenizers

Pre-trained models

Model #params Arch. Max length Input text
vinai/bartpho-syllable-base 132M base 1024 Syllable level
vinai/bartpho-syllable 396M large 1024 Syllable level
vinai/bartpho-word-base 150M base 1024 Word level
vinai/bartpho-word 420M large 1024 Word level

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

#BARTpho-syllable
syllable_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
bartpho_syllable = AutoModel.from_pretrained("vinai/bartpho-syllable")
TXT = 'Chúng tôi là những nghiên cứu viên.'  
input_ids = syllable_tokenizer(TXT, return_tensors='pt')['input_ids']
features = bartpho_syllable(input_ids)

#BARTpho-word
word_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-word")
bartpho_word = AutoModel.from_pretrained("vinai/bartpho-word")
TXT = 'Chúng_tôi là những nghiên_cứu_viên .'  
input_ids = word_tokenizer(TXT, return_tensors='pt')['input_ids']
features = bartpho_word(input_ids)

Using BARTpho in fairseq

Installation

There is an issue w.r.t. the encode function in the BART hub_interface, as discussed in this pull request facebookresearch/fairseq#3905. While waiting for this pull request's approval, please install fairseq as follows:

git clone https://github.com/datquocnguyen/fairseq.git
cd fairseq
pip install --editable ./

Pre-trained models

Model #params Download Input text
BARTpho-syllable 396M fairseq-bartpho-syllable.zip Syllable level
BARTpho-word 420M fairseq-bartpho-word.zip Word level
  • unzip fairseq-bartpho-syllable.zip
  • unzip fairseq-bartpho-word.zip

Example usage

from fairseq.models.bart import BARTModel  

#Load BARTpho-syllable model:  
model_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/'  
spm_model_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/sentence.bpe.model'  
bartpho_syllable = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='sentencepiece', sentencepiece_model=spm_model_path).eval()
#Input syllable-level/raw text:  
sentence = 'Chúng tôi là những nghiên cứu viên.'  
#Apply SentencePiece to the input text
tokenIDs = bartpho_syllable.encode(sentence, add_if_not_exist=False)
#Extract features from BARTpho-syllable
last_layer_features = bartpho_syllable.extract_features(tokenIDs)

##Load BARTpho-word model:  
model_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/'  
bpe_codes_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/bpe.codes'  
bartpho_word = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='fastbpe', bpe_codes=bpe_codes_path).eval()
#Input word-level text:  
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  
#Apply BPE to the input text
tokenIDs = bartpho_word.encode(sentence, add_if_not_exist=False)
#Extract features from BARTpho-word
last_layer_features = bartpho_word.extract_features(tokenIDs)

Notes

  • Before fine-tuning BARTpho on a downstream task, users should perform Vietnamese tone normalization on the downstream task's data as this pre-process was also applied to the pre-training corpus. A Python script for Vietnamese tone normalization is available at HERE.
  • For BARTpho-word, users should use VnCoreNLP to segment input raw texts as it was used to perform both Vietnamese tone normalization and word segmentation on the pre-training corpus.

License

MIT License

Copyright (c) 2021 VinAI Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

bartpho's People

Contributors

datquocnguyen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bartpho's Issues

Using BARTpho for Text summarization on Vietnamese

Thank you so much for this fantastic work. I'm currently doing research on BART model and having some questions regarding to BARTpho model in need for elaborating.
I have been using this training paradigm https://github.com/yixinL7/BRIO for text summarization, they used the pretrained BART model, facebook/bart-large-cnn as baseline and achieved pretty good result on CNN/DM dataset. I figured I could replaced the baseline model with BARTpho and do the same with my custom dataset. But the cross validation during training was quite poor no matter how I change the configuration.
So my question are:

  • Do I need to finetune BARTpho model on my custom dataset, I guess the facebook/bart-large-cnn model is able to achieve good result because it already trained on CNN/DM dataset.
  • If I do then can you show me how to finetune BARTpho on colab, since it a large model and I don't think colab has enough resources.

Cannot load BartPhoTokenizer when i load the tokenizer use syllabel

When i try to load to load the tokenizer use syllable, i get an error "module transformers.models.bartpho has no attribute BartPhoTokenizer".
i used transformers 4.15.0. I tried to clean my folder ~/.cache/huggingface/transformers but i still get the same error.
Can you help me! Thanks.
image

How can I do fine-tuning BARTpho for text summarization ?

When I call BARTpho-word via AutoModelForSeq2SeqLM to fine tune text summarization task, the decoder_start_token_id is None. How can I load correctly the model ? or how can I fine tune BARTpho for text summarization on my dataset ? My code is below:
model = AutoModelForSeq2SeqLM.from_pretrained(model_args.model_name_or_path)
print(model.config.decoder_start_token_id) None

Train model with unsupervised denoising objective

Hi authors,
I plan to pretrain BARTpho model on my custom vietnamese datasets with denoising objective (text infilling + sentence permutation as suggested in your paper). Having checked all issues and found this related one: #8, however, I still cannot find any example/notebooks in your given HF link which shows an instruction on how to pretrain BART on a custom dataset in denoising manner.

Could you please provide me with the link to pretrain BART? It would be very grateful.

Multiple Mask Tokens

I want to ask about Multiple Mask Tokens. For example TXT = "chúng tôi [mask] nghiên [mask] viên" I want to return the top_k of the [mask] at position 1 and the top_k of the [mask] at position 2 at the same time, does the model support it?

Whrere is `config.json` file?

Hi @datquocnguyen,

I am so attracted by your project. I followed your tutorial and try to train a new model, but I can not find config.json in fairseq-bartpho-word.zip. Can you tell me how to get it?

Thank you.

fine-tuning BARTpho with vinai/bartpho-syllabus, error undefined fairseqs_ids_to_tokens and 'unk' when using vinai/bartpho-word

  1. Error undefined fairseqs_ids_to_tokens, I'm trying to fine-tune based on vinai/bartpho-syllabus, can you suggest a solution for this issue?

image

"I have tried as instructed here: [https://github.com//issues/2#issuecomment-1146988402]"

"I have additionally followed the instructions provided in this CSV file format:
image

  1. When I use vinai/bartpho-word for training, there is no issue with the training process itself, but the prediction results show the appearance of the 'unk' character.
    image

Cannot load BartPhoTokenizer from pretrained when work with huggingface/transformers

Hi authors,

I get an error when I load the tokenizer from pretrained as bellow:

from transformers import AutoModel, AutoTokenizer
bartpho_syllable = AutoModel.from_pretrained("vinai/bartpho-syllable")
syllable_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable", use_fast=True)

AttributeError: module transformers.models.bartpho has no attribute BartphoTokenizer

I'm using Google Colab. Could you guys let me know how to overcome this issue.
Thanks.
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.