Giter VIP home page Giter VIP logo

g-map's Introduction

G-MAP

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Source code for the EMNLP 2022 main conference paper entitled G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Our experiment is build on the framework from huggingface transformers.4.4.1

The memory-augmented strategies of G-MAP are shown in the figure below. image info

Files:

.
├── domain_classification         # G-MAP Code for domain-specific classifcation tasks
│   ├── data/...                  # domain-specific classification datasets
│   ├── domain_models/...         # pretrained domain-specific models
│   ├── mem_roberta/...           # main source code for classfication tasks
│   ├── models/...                # pretrained general model
│   └── output_model_dir/...      # saved output models 
│          
│
├── domain_qa_ner                 # G-MAP Code for domain-specific QA and NER tasks
│   ├── data_ner/...              # domain-specific NER datasets
│   ├── data_qa/...               # domain-specific QA datasets
│   ├── domain_models/...         # pretrained domain-specific models    
│   ├── mem_roberta/...           # main source code for QA and NER tasks
|

To run the code for domain_document_classification tasks, the code is in domain_classification/. This corresponds to the domain-specific classification experiments in the paper.

To run the code for domain-specific Question Answering and NER tasks , the code is in domain_qa_ner. This corresponds to the domain QA and NER experiments in the paper.

The source codes are set to default of good hyperparameters, and can be used to train and inference for downstream-specific tasks :)


Setup:

Domain-specific model

For the convenience, we suggest you directly use the open-source domain-specific model from huggingface.

DAPT models

Available DAPT models:

allenai/cs_roberta_base
allenai/biomed_roberta_base
allenai/reviews_roberta_base
allenai/news_roberta_base

Datasets

Domain-specific Classification tasks

The experiments are conducted on eight classification tasks from four domains including biomedical sciences, computer science, news and reviews. They are:

  • ChemProt: a manually annotated chemical–protein interaction dataset extracted from 5,031 abstracts for relation classification;
  • RCT: contains approximately 200,000 abstracts from public medicine with the role of each sentence clearly identified;
  • CitationIntent: contains around 2,000 citations annotated for their function;
  • SciERC: consists of 500 scientific abstracts annotated for relation classification;
  • HyperPartisan: which contains 645 articles from Hyperpartisan news with either extreme left-wing or right-wing stand-point used for partisanship classification;
  • AGNews: consists of 127,600 categorized articles from more than 2000 news source for topic classification;
  • Amazon: consists of 145,251 reviews on Women’s and Men’s Clothing & Accessories, each representing users’ implicit feedback on items with a binary label signifying whether the majority of customers found the review helpful;
  • IMDB: 50,000 balanced positive and negative reviews from the Internet Movie Database for sentiment classification.

The domain-specific datasets can be downloaded and preprocessed from the code associated with the Don't Stop Pretraining ACL 2020 paper. Please cite this paper Gururangan et al. (2020) if you use their datasets.

Then, convert datasets to .tsv files:

python ../domain_classification/data/convert_dont_stop_corpus.py

Domain-specific QA tasks

In regard to the Medical QA task, please download the dataset from the below links https://github.com/panushri25/emrQA.

For the News QA task, please download the dataset from NewsQA.

Domain-specific NER tasks

Please check these datasets from https://huggingface.co/datasets

  • conll2003

  • ncbi_disease

  • wnut_17


Training and inference example:

Training of domain-specific document classification tasks

cd domain_classification;
cd mem_roberta;
# Choose the tasks in run_main.py
########################################
cmd1 ='''
   pip install tokenizers
   
   pip install regex
   
   python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 9588 mem_roberta/run_finetune.py --task_name {task_name} --max_seq_length {max_seq_length} --per_device_train_batch_size {per_device_train_batch_size} --learning_rate {learning_rate} --num_train_epochs {num_train_epochs} --output_dir {output_dir} --data_dir {data_dir} --Ngram_path {Ngram_path} --general_model {general_model} --domain_model {domain_model} --seed {seed}  
    '''.format(
        task_name='chemprot',
        max_seq_length=256,
        per_device_train_batch_size=16,
        learning_rate=4e-5,
        num_train_epochs = 14,
        output_dir=local_data_dir_output,
        data_dir=os.path.join(local_data_dir, "chemprot"), # chemprot, citation_intent, ag, amazon
        Ngram_path=local_data_dir+'/pmi_chemprot_ngram.txt',
        general_model=os.path.join(args.model_url, 'Roberta_base'),
        domain_model=os.path.join(args.model_url, 'Chem_bio'), # Chem_bio, CS, News, Reviews
        seed = 42
        )
########################################


python run_main.py --data_url /xxx --model_url --/xxx --save_dir --/xxx

Training of domain-specific QA tasks

cd domain_qa_ner;
cd mem_roberta;
# Choose the QA tasks in run_main.py
#######################################
cmd1 ='''
   pip install tokenizers
   
   pip install regex
   
   pip install transformers
   
   python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 9585 mem_roberta/run_finetune_qa_medication.py  --output_dir {output_dir} --data_url {data_dir} --general_model {general_model} --domain_model {domain_model} --seed {seed}  
    '''.format(
        output_dir=local_data_dir_output,
        data_dir=os.path.join(local_data_dir, "medication_new"),  # medicationQA, NewsQA
        general_model=os.path.join(args.model_url, 'Roberta_base'),
        domain_model=os.path.join(args.model_url, 'Chem_bio'),# News, Chem_bio
        seed = 42
        )

#######################################

python run_main.py --data_url /xxx --model_url --/xxx --save_dir --/xxx

Training of domain-specific NER tasks

cd domain_qa_ner;
cd mem_roberta;
# Choose the NER tasks in run_finetune_ner.py
#######################################
......
dataset_name = '../data_ner/ncbi_disease/ncbi_disease.py'
        # dataset_name = '../data_ner/wnut_17/wnut_17.py'
        # dataset_name = '../data_ner/conll2003/conll2003.py'
        raw_datasets = load_dataset(
            dataset_name,
            data_args.dataset_config_name,
            cache_dir=model_args.cache_dir,
            use_auth_token=True if model_args.use_auth_token else None,
        )
......
#######################################

# Choose the domain-specific models in run_finetune_ner.py
#######################################
......
chembio = '../domain_models/Chem_bio/pytorch_model.bin'
new = '/../domain_models/News/pytorch_model.bin'
domain_model_path = chembio
......
#######################################

python run_finetune_ner.py

For details of the methods and results, please refer to our paper.

@article{Wan2022GMAPGM,
  title={G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks},
  author={Zhongwei Wan and Yichun Yin and Wei Zhang and Jiaxin Shi and Lifeng Shang and Guangyong Chen and Xin Jiang and Qun Liu},
  journal={EMNLP},
  year={2022},
  volume={abs/2212.03613}
}

g-map's People

Contributors

sustechbruce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

g-map's Issues

how to set these two parameters, general_model and domain_model.

Dear authors,

I also have a question about the two parameters setting as follows:

     general_model=os.path.join(args.model_url, 'Roberta_base'),
     domain_model=os.path.join(args.model_url, 'Chem_bio')

Can you give me a specific example? and I do not understand how to run your code successfully.

Miao

experimental environment

Hi,

This is a nice work. I tried to re-run this code but the experiment environment is not suitable.

Could you share your experiment environment, such as which version of the transformers?

Many thanks

Miao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.