tomhosking / hrq-vae Goto Github PK

Hierarchical Sketch Induction for Paraphrase Generation (Hosking et al., ACL 2022)

License: MIT License

Jupyter Notebook 15.66% Python 84.34%

hrq-vae's Introduction

Tom Hosking

I'm a PhD student in NLP at Edinburgh supervised by Mirella Lapata, working on discrete latent variable models for language generation.

Projects

HIRO: Hierarchical Indexing for Retrieval-Augmented Opinion Summarization - Tom Hosking, Hao Tang & Mirella Lapata

Human Feedback is not Gold Standard - Tom Hosking, Phil Blunsom & Max Bartolo (ICLR 2024)

Hercules - Code for the paper "Attributable and Scalable Opinion Summarization", Tom Hosking, Hao Tang & Mirella Lapata (ACL 2023)

HRQ-VAE - Code for the paper "Hierarchical Sketch Induction for Paraphrase Generation", Tom Hosking, Hao Tang & Mirella Lapata (ACL 2022)

Separator - Code for the paper "Factorising Meaning and Form for Intent-Preserving Paraphrasing", Tom Hosking & Mirella Lapata (ACL 2021)

TorchSeq - a sequence modelling framework, built in PyTorch

McKenzie - a Slurm scheduler job monitor

My neural question generation model implementation

hrq-vae's People

Contributors

Stargazers

Watchers

Forkers

casually-pylearner gary-code harshraj172 tjudoubi linjianli lihaofan3 coverquick

hrq-vae's Issues

How to support other languages?

Is it possible to support other languages such as Chinese? What modifications do I need to make? Thank you.

Can't get the example to work.

Hi
I am trying out this model for generating paraphrases.
I am following the guideline in readme to test the model, but I am getting an error
on running the code which I don't know how to resolve:
Exception: Tokenizer needs to be initialized with a model name before use!

Please help me out with this. Thanks.
test1_error.txt

can't install torchseq

Hiya,

I'll leave this on the torchseq repository as well.

When I run

python3 -m pip install git+https://github.com/tomhosking/[email protected]

via terminal, I get this issue

Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [32 lines of output]
      Traceback (most recent call last):
        File "/Users/voi/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/Users/voi/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/Users/voi/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
          return self._get_build_requires(
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 127, in _get_build_requires
          self.run_setup()
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 142, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File "setup.py", line 7, in <module>
          setup(
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/__init__.py", line 165, in setup
          return distutils.core.setup(**attrs)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 108, in setup
          _setup_distribution = dist = klass(attrs)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/dist.py", line 429, in __init__
          _Distribution.__init__(self, {
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 293, in __init__
          self.finalize_options()
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/dist.py", line 721, in finalize_options
          ep(self)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/dist.py", line 727, in _finalize_setup_keywords
          ep.require(installer=self.fetch_build_egg)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2483, in require
          items = working_set.resolve(reqs, env, installer, extras=self.extras)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/pkg_resources/__init__.py", line 790, in resolve
          raise VersionConflict(dist, req).with_context(dependent_req)
      pkg_resources.VersionConflict: (setuptools 49.2.1 (/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages), Requirement.parse('setuptools>=58.0'))
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

how to train for other language?

hi, thanks for such great repo
i already create dataset for my target languag like wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100
and put it in data folder.
for second stage i nead bert-base-multilingual-uncased.embeddings.pt
how can i achive to this?

i already test it :
############################################################
from transformers import BertModel, BertTokenizer, BertConfig
import torch

enc = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")

text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

masked_index = 8
tokenized_text[masked_index] = '[MASK]'
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = [tokens_tensor, segments_tensors]

model = BertModel.from_pretrained("bert-base-multilingual-cased", torchscript=True)

traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
torch.jit.save(traced_model, "bert-base-multilingual-uncased.embeddings.pt")
############################################

but got error:
/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torch/serialization.py:781: UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning)
" silence this warning)", UserWarning)
Traceback (most recent call last):
File "/root/anaconda3/envs/hrq/bin/torchseq", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/main.py", line 110, in main
use_cuda=(not args.cpu),
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/agents/para_agent.py", line 65, in init
self.model = BottleneckAutoencoderModel(self.config, src_field=self.src_field)
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/models/bottleneck_autoencoder.py", line 35, in init
self.seq_encoder = SequenceEncoder(config)
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/models/encoder.py", line 24, in init
self.embeddings.weight.data = Tokenizer().get_embeddings(config.prepro.tokenizer)
TypeError: Variable data has to be a tensor, but got RecursiveScriptModule

'Config' object has no attribute 'encoder_outputs'

Hello!!
When I use the checkpoint model and load the config file, it shows no attribute 'encoder_outputs'?!

torch version

**Ran into an issue while trying your models out.

Made a new environment.
I installed the requirements.txt.
I added the models to a models folder.

When I tried to run your code in any of the notebooks provided I encountered the following error:**

Exception                                 Traceback (most recent call last)
Input In [1], in <cell line: 39>()
     37 config = Config(cfg_dict)
     38 checkpoint_path = path_to_model + "/model/checkpoint.pt"
---> 39 agent = ParaphraseAgent(config=config, run_id=None,  output_path=None, data_path=DATA_PATH, silent=False, verbose=False, training_mode=False)
     41 # Load the checkpoint
     42 agent.load_checkpoint(checkpoint_path)

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\agents\para_agent.py:76, in ParaphraseAgent.__init__(self, config, run_id, output_path, data_path, silent, training_mode, verbose, cache_root, use_cuda)
     73 if training_mode:
     74     self.create_optimizer()
---> 76 self.set_device(use_cuda)
     78 self.create_samplers()

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\agents\base.py:26, in BaseAgent.set_device(self, use_cuda)
     24 if use_cuda and not self.cuda_available:
     25     self.logger.error("Use CUDA is set to true, but not CUDA devices were found!")
---> 26     raise Exception("No CUDA devices found")
     28 self.cuda = self.cuda_available & use_cuda
     30 if self.cuda:

Exception: No CUDA devices found

**When I checked the torch version that was installed it was 1.10.2+cpu.

Your requirements.txt in torchseq was torch=1.11.0 so I am a bit confused how 1.10.2+cpu was installed.

So I uninstalled torch and reinstalled it using the following command:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

I successfully install it with the following error:**

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
torchseq 0.0.1 requires torch==1.10.2, but you have torch 1.11.0+cu113 which is incompatible.
Successfully installed torch-1.11.0+cu113 torchaudio-0.11.0+cu113 torchvision-0.12.0+cu113

**I check my torch version and it is 1.11.0+cu113.

However, when I proceed to run your code provided I encounter the following error:**

Some unexpected keys were found in the loaded checkpoint: 
Some unexpected keys were found in the loaded checkpoint: 
seq_encoder.embedding_projection.weight_g
seq_encoder.embedding_projection.weight_v
bottleneck.module_list.1.quantizer._alpha
bottleneck.module_list.1.quantizer._ema_cluster_size0
bottleneck.module_list.1.quantizer._ema_cluster_size1
bottleneck.module_list.1.quantizer._ema_cluster_size2
bottleneck.module_list.1.quantizer._ema_w.0
bottleneck.module_list.1.quantizer._ema_w.1
bottleneck.module_list.1.quantizer._ema_w.2

Exception                                 Traceback (most recent call last)
Input In [2], in <cell line: 45>()
     42 instance.model.eval()
     44 # Finally, run inference
---> 45 _, _, (pred_output, _, _), _ = instance.inference(data_loader.test_loader)
     47 print(pred_output)

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\agents\model_agent.py:554, in ModelAgent.inference(self, data_loader, memory_keys_to_return, metric_hooks, use_test, training_loop)
    552 with torch.no_grad():
    553     num_samples = 0
--> 554     for batch_idx, batch in enumerate(
    555         tqdm(data_loader, desc="Validating after {:} epochs".format(self.current_epoch), disable=self.silent)
    556     ):
    557         batch = {k: (v.to(self.device) if k[-5:] != "_text" and k[0] != "_" else v) for k, v in batch.items()}
    559         curr_batch_size = batch[[k for k in batch.keys() if k[-5:] != "_text"][0]].size()[0]

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\tqdm\std.py:1183, in tqdm.__iter__(self)
   1180 # If the bar is disabled, then just walk the iterable
   1181 # (note: keep this check outside the loop for performance)
   1182 if self.disable:
-> 1183     for obj in iterable:
   1184         yield obj
   1185     return

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\dataloader.py:530, in _BaseDataLoaderIter.__next__(self)
    528 if self._sampler_iter is None:
    529     self._reset()
--> 530 data = self._next_data()
    531 self._num_yielded += 1
    532 if self._dataset_kind == _DatasetKind.Iterable and \
    533         self._IterableDataset_len_called is not None and \
    534         self._num_yielded > self._IterableDataset_len_called:

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\dataloader.py:1224, in _MultiProcessingDataLoaderIter._next_data(self)
   1222 else:
   1223     del self._task_info[idx]
-> 1224     return self._process_data(data)

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\dataloader.py:1250, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1248 self._try_put_index()
   1249 if isinstance(data, ExceptionWrapper):
-> 1250     data.reraise()
   1251 return data

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\_utils.py:457, in ExceptionWrapper.reraise(self)
    453 except TypeError:
    454     # If the exception takes multiple arguments, don't try to
    455     # instantiate since we don't know how to
    456     raise RuntimeError(msg) from None
--> 457 raise exception

Exception: Caught Exception in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\_utils\worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\datasets\json_dataset.py", line 98, in __getitem__
    return JsonDataset.to_tensor(
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\datasets\json_dataset.py", line 121, in to_tensor
    parsed = JsonDataInstance(sample, [f["to"] for f in fields])
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\datasets\json_dataset.py", line 210, in __init__
    _doc = Tokenizer().tokenise(sample[f])
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\utils\singleton.py", line 6, in __call__
    cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\utils\tokenizer.py", line 86, in __init__
    raise Exception("Tokenizer needs to be initialized with a model name before use!")
Exception: Tokenizer needs to be initialized with a model name before use!

I was wondering if you could help me identify where I went wrong here?

Lower performance with trained model

Hi, my problem also is here L-Zhe/BTmPG#2.
On MSCOCO data, The model‘s performance is not good. BLEU 8.79, self-BLEU: 18.56.
Could you please tell me some tricks to train the model on MSCOCO data?
Thanks!

How to get the upper bound in the paper?

I want to know how to generate sentences using the sketch of selected exemplar. In other words, where can I get the oracle sketch of selected exemplar? Looking forward to your reply.

Regarding revaluation with multiple references

Hi, thanks for your great project!

I wonder how to evaluate with multiple references (e.g., MSCOCO).

BLEU (precision-based) score supports multiple references inherently, but how about Rouge scores?

Training/Dev/Test split: splitforgeneval vs. training-triples

Hi, thanks for sharing the wonderful project and data. I am trying to use the released data for training my own T5-based paraphrasing model. However, there are multiple sets of train/dev/test.jsonl files under different folders. For example,

for paralex:

wikianswers-para-splitforgeneval
training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/ (BTW, the name is also not the same as specified under the conf folder)

for qqp:

qqp-splitforgeneval
training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/

I also found there might be potential "overlaps" between train and test sets under the same folder, for example,

grep 'Do astrology really work' qqp-splitforgeneval/test.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Do astrology really work?", "paras": ["Dose astrology really work?"]}

VS.

grep 'Dose astrology really work?' qqp-splitforgeneval/train.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Does Rashi prediction really work?", "paras": ["Dose astrology really work?", "Does astrology works?", "Do astrology really work?", "Does astrology really work, I mean the online astrology?"]}

My questions are

What is the relationship between qqp-splitforgeneval and training-triples?
if I want to compare the results with the paper, which sets should I use, i.e. splitforgeneval or training-triples? (I do not need the "syn_input" utterances)
is it safe to assume there are no overlaps among train/dev/eval sets under the same folder? (e.g., Is it possible for a test "sem_input" to appear in train.jsonl under the different folders?)

Thanks and I appreciate your help.