Paraphrase any question with T5 (Text-To-Text Transfer Transformer) - Pretrained model and training script provided

Jupyter Notebook 28.39% Python 71.61%

paraphrase-any-question-with-t5-text-to-text-transfer-transformer-'s Introduction

Paraphrase any question with T5 (Text-To-Text Transfer Transformer) - Pretrained model and training script provided

Using this program you can generate paraphrases of any given question.

A detailed Medium blogpost explaining necessary steps can be found here.

Input

The input to our program will be any general question that you can think of -

Which course should I take to get started in data Science?

Ouput

The output will be paraphrased versions of the same question. Paraphrasing a question means, you create a new question that expresses the same meaning using a different choice of words.

Paraphrased Questions generated from the T5 Model :

0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?

Inference code

The t5-pretrained-question-paraphraser.ipynb notebook has all the code to run the model on any given question of your choice and generate paraphrases for it.

Training the model

The training and validation datasets are present in the paraphrase_data folder. Install the necessary libraries from requirements.txt. Use any GPU machine and run train.py

Training this model for 2 epochs (default) took about 20 hrs on p2.xlarge (AWS ec2)

paraphrase-any-question-with-t5-text-to-text-transfer-transformer-'s People

Contributors

Stargazers

Watchers

paraphrase-any-question-with-t5-text-to-text-transfer-transformer-'s Issues

Failed to run train.py

I need to retrain this model on a paraphrase dataset that is non-question-based for research purposes. I got the following errors when I ran train.py. I have Python 3.7.6, torch 1.5.1, transformers 3.0.2, pytorch_lightning 0.8.5.

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

BrokenPipeError: [Errno 32] Broken pipe

Unable to adapt into multi-gpu training.

This code worked perfectly on a single GPU and slightly different tasks/datasets. Thank you very much. But when I tried to transfer it into multi GPU training, it returned an Exception:

MisconfigurationException: No `training_step()` method defined. Lightning `Trainer` expects as minimum a `training_step()`, `training_dataloader()` and `configure_optimizers()` to be defined.

To transfer this code into multi GPU, I added os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' before importing torch, and this

device_ids = [0,1]
model = torch.nn.DataParallel(model, device_ids=device_ids)
model = model.cuda()

right after I defined model=T5Finetuner(args).

No further modification is made to the original code(which I ran it successfully). How does that exception happen?

Determine probability of two texts being paraphrases of each other?

Hi there,

I am wondering if it's possible to use this to determine the probability of two texts being paraphrases of each other, rather than for paraphrase generation?

Thanks,
Steven

Exporting the model

This is a great model and the Jupyter notebook is super helpful! Inference works great in the notebook, but I'm curious if you could help me figure out how to export the model so I can use it with ONNX, or Pytorch mobile, etc.

init() got an unexpected keyword argument 'early_stop_callback'

File "/Users/mac/anaconda3/envs/test/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)

Its showing the issue with the PyTorch lightning.

Please help.

Different reactions on Google Colab and local PC

I've tried the train.py on both Google Colab and my own laptop. They share the same pytorch-lightning & transformers version. The previous steps works fine but on the progress bar part of training (model.fit), Google colab works fine but my laptop shows ?it/s all night even though I set both batch_size to 1 and truncate the dataset to only 10k. Is this because my laptop GPU is out of memory? How can I fix it?

The stuck is like this ↓

t5-base model download

Hi
This solution is really helpful. But unfortunately due to proxy issues on my workstation, the t5-base model is not getting downloaded. I have downloaded the model and .bin files manually from s3 and huggingface. Any idea on how it can be linked. I am still getting an error:
OSError: Model name 't5-base' was not found in tokenizers model name list

getting error in train.py

Hi @ramsrigouthamg

I am getting an error at the following line

tokenizer = T5Tokenizer.from_pretrained('t5-base')

error says : ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.

logging is not happening

There is no such file "test_results.txt" getting generated, can you please check for this?

Code getting stuck while generating paraphrases for more than 2 lakh sentences.

I'm having an issue where my code for generating paraphrases is stopping after processing 200,000 sentences out of 5 million and I need to manually restart it. Is there a way to fix this problem?

Model seems to perform tasks other than paraphrasing too.

@ramsrigouthamg I trained the model using the scripts in this repo and the model was performing some other task like sentiment analysis, etc. The predictions of the model are shown below.


Truth:
It is located at 142 South Rexford Drive in Beverly Hills . It is opposite the First Church of Christ , scientist , Beverly Hills , California .

Prediction:
True
False
________________________________________________________________________________
Daniel Armstrong is an Australian film director who is also known for his work as a writer , producer and editor .

Truth:
Daniel Armstrong is an Australian film director . Armstrong is also known for his work as a writer , producer and editor .

Prediction:
True
False
________________________________________________________________________________
Magnus turned around and reformed the British invasion of Williams by attacking the Eric Young and Orlando Jordan team .

Truth:
Magnus turned heel and reformed the British Invasion with Williams by attacking the team of Eric Young and Orlando Jordan .

Prediction:
W. Magnus in. Magnus turned around and reformed the British invasion of Williams by attacking the Eric Young and Orlando Jordan team.
"Magnus turned around and reformed the invasion of Williams by attacking the Eric Young and Orlando Jordan team ".
- In his own words., Magnus turned around and reformed the British invasion of Williams by attacking the Eric Young and Orlando Jordan team.
Magnus took over the British invasion of Williams by attacking the Eric Young and Orlando Jordan team.
Great: Magnus changed course and reversed the British invasion of Williams by attacking the Eric Young and Orlando Jordan team.

I think (and hope) something is going wrong in the way the model is called and not the training process itself. This is how I call the model -

import json
import logging
import torch
from datetime import datetime

import numpy as np
import pandas as pd
from transformers import T5ForConditionalGeneration,T5Tokenizer

from tqdm import tqdm

def set_seed(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(42)

logging.basicConfig(level=logging.ERROR)
device = "cuda:1"

model_args = {
    "overwrite_output_dir": True,
    "max_seq_length": 256,
    "eval_batch_size": 32,
    "num_train_epochs": 1,
    "use_multiprocessing": True,
    "num_beams": None,
    "do_sample": True,
    "max_length": 50,
    "top_k": 120,
    "top_p": 0.95,
    "num_return_sequences": 5,
}

prefix = "paraphrasing"

# Load the trained model
model = T5ForConditionalGeneration.from_pretrained('t5_paraphrase/epoch2/')
model = model.to(device)
tokenizer = T5Tokenizer.from_pretrained('t5-base')

# Load the evaluation data
df = pd.read_csv("paraphrase_data/val.tsv", sep="\t")
df.columns = ["input_text", "target_text"]
df.insert(0, "prefix", ['paraphrase']*len(df), True)

# Prepare the data for testing
to_predict = [
    prefix + ": " + str(input_text)
    for prefix, input_text in zip(df["prefix"].tolist(), df["input_text"].tolist())
]
truth = df["target_text"].tolist()
print(to_predict[:5])

preds = []
# Get the model predictions
for sentence in tqdm(to_predict):
    encoding = tokenizer.encode_plus(sentence,pad_to_max_length=True, return_tensors="pt")
    input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    # set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
    beam_outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_masks,
        do_sample=True,
        max_length=256,
        top_k=50,
        top_p=0.95,
        early_stopping=True,
        num_return_sequences=5
    )
    final_outputs =[]
    for beam_output in beam_outputs:
        sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
        if sent.lower() != sentence.lower() and sent not in final_outputs:
            final_outputs.append(sent)
    preds.append(final_outputs)

# Saving the predictions if needed
with open(f"predictions/predictions_{datetime.now()}.txt", "w") as f:
    for i, text in enumerate(df["input_text"].tolist()):
        f.write(str(text) + "\n\n")

        f.write("Truth:\n")
        f.write(truth[i] + "\n\n")

        f.write("Prediction:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write(
            "________________________________________________________________________________\n"
        )

multilingual version

Hi,

is there a multilingual version for this model?

thank u

Unable to run train.py

Apologies for accidentally closing the previous issue... I'm using Nvidia GeForce RTX SUPER Series Graphics , and I'm running it locally.

ramsrigouthamg / paraphrase-any-question-with-t5-text-to-text-transfer-transformer- Goto Github PK

paraphrase-any-question-with-t5-text-to-text-transfer-transformer-'s Introduction

Paraphrase any question with T5 (Text-To-Text Transfer Transformer) - Pretrained model and training script provided

Input

Ouput

Inference code

Training the model

paraphrase-any-question-with-t5-text-to-text-transfer-transformer-'s People

Contributors

Stargazers

Watchers

Forkers

paraphrase-any-question-with-t5-text-to-text-transfer-transformer-'s Issues

Recommend Projects

Recommend Topics

Recommend Org