Giter VIP home page Giter VIP logo

hrq-vae's Introduction

I'm a PhD student in NLP at Edinburgh supervised by Mirella Lapata, working on discrete latent variable models for language generation.

Projects

HIRO: Hierarchical Indexing for Retrieval-Augmented Opinion Summarization - Tom Hosking, Hao Tang & Mirella Lapata

Human Feedback is not Gold Standard - Tom Hosking, Phil Blunsom & Max Bartolo (ICLR 2024)

Hercules - Code for the paper "Attributable and Scalable Opinion Summarization", Tom Hosking, Hao Tang & Mirella Lapata (ACL 2023)

HRQ-VAE - Code for the paper "Hierarchical Sketch Induction for Paraphrase Generation", Tom Hosking, Hao Tang & Mirella Lapata (ACL 2022)

Separator - Code for the paper "Factorising Meaning and Form for Intent-Preserving Paraphrasing", Tom Hosking & Mirella Lapata (ACL 2021)

TorchSeq - a sequence modelling framework, built in PyTorch

McKenzie - a Slurm scheduler job monitor

My neural question generation model implementation

hrq-vae's People

Contributors

tomhosking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hrq-vae's Issues

Can't get the example to work.

Hi
I am trying out this model for generating paraphrases.
I am following the guideline in readme to test the model, but I am getting an error
on running the code which I don't know how to resolve:
Exception: Tokenizer needs to be initialized with a model name before use!

Please help me out with this. Thanks.
test1_error.txt

can't install torchseq

Hiya,

I'll leave this on the torchseq repository as well.

When I run

python3 -m pip install git+https://github.com/tomhosking/[email protected]

via terminal, I get this issue

Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [32 lines of output]
      Traceback (most recent call last):
        File "/Users/voi/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/Users/voi/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/Users/voi/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
          return self._get_build_requires(
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 127, in _get_build_requires
          self.run_setup()
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 142, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File "setup.py", line 7, in <module>
          setup(
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/__init__.py", line 165, in setup
          return distutils.core.setup(**attrs)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 108, in setup
          _setup_distribution = dist = klass(attrs)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/dist.py", line 429, in __init__
          _Distribution.__init__(self, {
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 293, in __init__
          self.finalize_options()
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/dist.py", line 721, in finalize_options
          ep(self)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/dist.py", line 727, in _finalize_setup_keywords
          ep.require(installer=self.fetch_build_egg)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2483, in require
          items = working_set.resolve(reqs, env, installer, extras=self.extras)
        File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/pkg_resources/__init__.py", line 790, in resolve
          raise VersionConflict(dist, req).with_context(dependent_req)
      pkg_resources.VersionConflict: (setuptools 49.2.1 (/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages), Requirement.parse('setuptools>=58.0'))
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

how to train for other language?

hi, thanks for such great repo
i already create dataset for my target languag like wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100
and put it in data folder.
for second stage i nead bert-base-multilingual-uncased.embeddings.pt
how can i achive to this?

i already test it :
############################################################
from transformers import BertModel, BertTokenizer, BertConfig
import torch

enc = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")

text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

masked_index = 8
tokenized_text[masked_index] = '[MASK]'
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = [tokens_tensor, segments_tensors]

model = BertModel.from_pretrained("bert-base-multilingual-cased", torchscript=True)

traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
torch.jit.save(traced_model, "bert-base-multilingual-uncased.embeddings.pt")
############################################

but got error:
/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torch/serialization.py:781: UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning)
" silence this warning)", UserWarning)
Traceback (most recent call last):
File "/root/anaconda3/envs/hrq/bin/torchseq", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/main.py", line 110, in main
use_cuda=(not args.cpu),
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/agents/para_agent.py", line 65, in init
self.model = BottleneckAutoencoderModel(self.config, src_field=self.src_field)
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/models/bottleneck_autoencoder.py", line 35, in init
self.seq_encoder = SequenceEncoder(config)
File "/root/anaconda3/envs/hrq/lib/python3.7/site-packages/torchseq/models/encoder.py", line 24, in init
self.embeddings.weight.data = Tokenizer().get_embeddings(config.prepro.tokenizer)
TypeError: Variable data has to be a tensor, but got RecursiveScriptModule

torch version

**Ran into an issue while trying your models out.

Made a new environment.
I installed the requirements.txt.
I added the models to a models folder.

When I tried to run your code in any of the notebooks provided I encountered the following error:**

Exception                                 Traceback (most recent call last)
Input In [1], in <cell line: 39>()
     37 config = Config(cfg_dict)
     38 checkpoint_path = path_to_model + "/model/checkpoint.pt"
---> 39 agent = ParaphraseAgent(config=config, run_id=None,  output_path=None, data_path=DATA_PATH, silent=False, verbose=False, training_mode=False)
     41 # Load the checkpoint
     42 agent.load_checkpoint(checkpoint_path)

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\agents\para_agent.py:76, in ParaphraseAgent.__init__(self, config, run_id, output_path, data_path, silent, training_mode, verbose, cache_root, use_cuda)
     73 if training_mode:
     74     self.create_optimizer()
---> 76 self.set_device(use_cuda)
     78 self.create_samplers()

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\agents\base.py:26, in BaseAgent.set_device(self, use_cuda)
     24 if use_cuda and not self.cuda_available:
     25     self.logger.error("Use CUDA is set to true, but not CUDA devices were found!")
---> 26     raise Exception("No CUDA devices found")
     28 self.cuda = self.cuda_available & use_cuda
     30 if self.cuda:

Exception: No CUDA devices found

**When I checked the torch version that was installed it was 1.10.2+cpu.

Your requirements.txt in torchseq was torch=1.11.0 so I am a bit confused how 1.10.2+cpu was installed.

So I uninstalled torch and reinstalled it using the following command:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

I successfully install it with the following error:**

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
torchseq 0.0.1 requires torch==1.10.2, but you have torch 1.11.0+cu113 which is incompatible.
Successfully installed torch-1.11.0+cu113 torchaudio-0.11.0+cu113 torchvision-0.12.0+cu113

**I check my torch version and it is 1.11.0+cu113.

However, when I proceed to run your code provided I encounter the following error:**

Some unexpected keys were found in the loaded checkpoint: 
Some unexpected keys were found in the loaded checkpoint: 
seq_encoder.embedding_projection.weight_g
seq_encoder.embedding_projection.weight_v
bottleneck.module_list.1.quantizer._alpha
bottleneck.module_list.1.quantizer._ema_cluster_size0
bottleneck.module_list.1.quantizer._ema_cluster_size1
bottleneck.module_list.1.quantizer._ema_cluster_size2
bottleneck.module_list.1.quantizer._ema_w.0
bottleneck.module_list.1.quantizer._ema_w.1
bottleneck.module_list.1.quantizer._ema_w.2

Exception                                 Traceback (most recent call last)
Input In [2], in <cell line: 45>()
     42 instance.model.eval()
     44 # Finally, run inference
---> 45 _, _, (pred_output, _, _), _ = instance.inference(data_loader.test_loader)
     47 print(pred_output)

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\agents\model_agent.py:554, in ModelAgent.inference(self, data_loader, memory_keys_to_return, metric_hooks, use_test, training_loop)
    552 with torch.no_grad():
    553     num_samples = 0
--> 554     for batch_idx, batch in enumerate(
    555         tqdm(data_loader, desc="Validating after {:} epochs".format(self.current_epoch), disable=self.silent)
    556     ):
    557         batch = {k: (v.to(self.device) if k[-5:] != "_text" and k[0] != "_" else v) for k, v in batch.items()}
    559         curr_batch_size = batch[[k for k in batch.keys() if k[-5:] != "_text"][0]].size()[0]

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\tqdm\std.py:1183, in tqdm.__iter__(self)
   1180 # If the bar is disabled, then just walk the iterable
   1181 # (note: keep this check outside the loop for performance)
   1182 if self.disable:
-> 1183     for obj in iterable:
   1184         yield obj
   1185     return

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\dataloader.py:530, in _BaseDataLoaderIter.__next__(self)
    528 if self._sampler_iter is None:
    529     self._reset()
--> 530 data = self._next_data()
    531 self._num_yielded += 1
    532 if self._dataset_kind == _DatasetKind.Iterable and \
    533         self._IterableDataset_len_called is not None and \
    534         self._num_yielded > self._IterableDataset_len_called:

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\dataloader.py:1224, in _MultiProcessingDataLoaderIter._next_data(self)
   1222 else:
   1223     del self._task_info[idx]
-> 1224     return self._process_data(data)

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\dataloader.py:1250, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1248 self._try_put_index()
   1249 if isinstance(data, ExceptionWrapper):
-> 1250     data.reraise()
   1251 return data

File C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\_utils.py:457, in ExceptionWrapper.reraise(self)
    453 except TypeError:
    454     # If the exception takes multiple arguments, don't try to
    455     # instantiate since we don't know how to
    456     raise RuntimeError(msg) from None
--> 457 raise exception

Exception: Caught Exception in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\_utils\worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\datasets\json_dataset.py", line 98, in __getitem__
    return JsonDataset.to_tensor(
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\datasets\json_dataset.py", line 121, in to_tensor
    parsed = JsonDataInstance(sample, [f["to"] for f in fields])
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\datasets\json_dataset.py", line 210, in __init__
    _doc = Tokenizer().tokenise(sample[f])
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\utils\singleton.py", line 6, in __call__
    cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
  File "C:\Workspace\Anaconda3\envs\gen\lib\site-packages\torchseq\utils\tokenizer.py", line 86, in __init__
    raise Exception("Tokenizer needs to be initialized with a model name before use!")
Exception: Tokenizer needs to be initialized with a model name before use!

I was wondering if you could help me identify where I went wrong here?

Lower performance with trained model

Hi, my problem also is here L-Zhe/BTmPG#2.
On MSCOCO data, The model‘s performance is not good. BLEU 8.79, self-BLEU: 18.56.
Could you please tell me some tricks to train the model on MSCOCO data?
Thanks!

How to get the upper bound in the paper?

I want to know how to generate sentences using the sketch of selected exemplar. In other words, where can I get the oracle sketch of selected exemplar? Looking forward to your reply.

Regarding revaluation with multiple references

Hi, thanks for your great project!

I wonder how to evaluate with multiple references (e.g., MSCOCO).

BLEU (precision-based) score supports multiple references inherently, but how about Rouge scores?

Training/Dev/Test split: splitforgeneval vs. training-triples

Hi, thanks for sharing the wonderful project and data. I am trying to use the released data for training my own T5-based paraphrasing model. However, there are multiple sets of train/dev/test.jsonl files under different folders. For example,

for paralex:

  1. wikianswers-para-splitforgeneval
  2. training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/ (BTW, the name is also not the same as specified under the conf folder)

for qqp:

  1. qqp-splitforgeneval
  2. training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/

I also found there might be potential "overlaps" between train and test sets under the same folder, for example,

grep 'Do astrology really work' qqp-splitforgeneval/test.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Do astrology really work?", "paras": ["Dose astrology really work?"]}

VS.

grep 'Dose astrology really work?' qqp-splitforgeneval/train.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Does Rashi prediction really work?", "paras": ["Dose astrology really work?", "Does astrology works?", "Do astrology really work?", "Does astrology really work, I mean the online astrology?"]}

My questions are

  1. What is the relationship between qqp-splitforgeneval and training-triples?
  2. if I want to compare the results with the paper, which sets should I use, i.e. splitforgeneval or training-triples? (I do not need the "syn_input" utterances)
  3. is it safe to assume there are no overlaps among train/dev/eval sets under the same folder? (e.g., Is it possible for a test "sem_input" to appear in train.jsonl under the different folders?)

Thanks and I appreciate your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.