This is Bert2Bert EncoderDecoderModel train on Liputan6 Dataset Canonical, this model was base on this Documentation and this notebook
Colab:
!pip install torch
!pip install transformers[torch]
!pip install evaluate
!pip install datasets
Cmd:
pip install torch
pip install transformers[torch]
pip install evaluate
pip install datasets
git clone https://github.com/zanuura/Bert2Bert_Summarization_Liputan6
from transformers import EncoderDecoderModel, AutoTokenizer, pipeline
import datasets
model = EncoderDecoderModel.from_pretrained("Bert2Bert_Summarization_Liputan6/model/") # insert the path
tokenizer = AutoTokenizer.from_pretrained("Bert2Bert_Summarization_Liputan6/model/") # you also can change the tokenizer from bert-base-uncased
## this is test with Liputan6 Test Dataset
## Load rouge for validation
rouge = datasets.load_metric("rouge")
def generate_summary(batch):
inputs = tokenizer(batch['clean_article'], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
outputs = model.generate(input_ids, attention_mask=attention_mask)
outputs_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch['pred'] = outputs_str
return batch
results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["clean_article"])
pred_str = results['pred']
label_str = results['clean_summary']
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
print(rouge_output)
References:
- EncoderDecoderModel
- BertGeneration
- NoteBook Bert2Bert CNN Daily
- Datasets Liputan6 info
- Datasets
Hope you enjoyit ๐.