Giter VIP home page Giter VIP logo

vec2text's Introduction

vec2text

This library contains code for doing text embedding inversion. We can train various architectures that reconstruct text sequences from embeddings as well as run pre-trained models. This repository contains code for the paper "Text Embeddings Reveal (Almost) As Much As Text".

To get started, install this on PyPI:

pip install vec2text

Link to Colab Demo

Development

If you're training a model you'll need to set up nltk:

import nltk
nltk.download('punkt')

Before pushing any code, please run precommit:

pre-commit run --all

Usage

The library can be used to embed text and then invert it, or invert directly from embeddings. First you'll need to construct a Corrector object which wraps the necessary models, embedders, and tokenizers:

Load a model via load_pretrained_corrector

corrector = vec2text.load_pretrained_corrector("text-embedding-ada-002")

Load a model via load_corrector

If you have trained you own custom models using vec2text, you can load them in using the load_corrector function.

inversion_model = vec2text.models.InversionModel.from_pretrained("jxm/gtr__nq__32")
corrector_model = vec2text.models.CorrectorEncoderModel.from_pretrained("jxm/gtr__nq__32__correct")

corrector = vec2text.load_corrector(inversion_model, corrector_model)

Both vec2text.models.InversionModel and vec2text.models.CorrectorEncoderModel classes inherit transformers.PreTrainedModel therefore you can pass in a huggingface model name or path to a local directory.

Invert text with invert_strings

vec2text.invert_strings(
    [
        "Jack Morris is a PhD student at Cornell Tech in New York City",
        "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
    ],
    corrector=corrector,
)
['Morris is a PhD student at Cornell University in New York City',
 'It was the age of incredulity, the age of wisdom, the age of apocalypse, the age of apocalypse, it was the age of faith, the age of best faith, it was the age of foolishness']

By default, this will make a single guess (using the hypothesizer). For better results, you can make multiple steps:

vec2text.invert_strings(
    [
        "Jack Morris is a PhD student at Cornell Tech in New York City",
        "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
    ],
    corrector=corrector,
    num_steps=20,
)
['Jack Morris is a PhD student in tech at Cornell University in New York City',
 'It was the best time of the epoch, it was the worst time of the epoch, it was the best time of the age of wisdom, it was the age of incredulity, it was the age of betrayal']

And for even better results, you can increase the size of the search space by setting sequence_beam_width to a positive integer:

vec2text.invert_strings(
    [
        "Jack Morris is a PhD student at Cornell Tech in New York City",
        "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
    ],
    corrector=corrector,
    num_steps=20,
    sequence_beam_width=4,
)
['Jack Morris is a PhD student at Cornell Tech in New York City',
 'It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity']

Note that this technique has to store sequence_beam_width * sequence_beam_width hypotheses at each step, so if you set it too high, you'll run out of GPU memory.

Invert embeddings with invert_embeddings

If you only have embeddings, you can invert them directly:

import torch

def get_embeddings_openai(text_list, model="text-embedding-ada-002") -> torch.Tensor:
    client = openai.OpenAI()
    response = client.embeddings.create(
        input=text_list,
        model=model,
        encoding_format="float",  # override default base64 encoding...
    )
    outputs.extend([e["embedding"] for e in response["data"]])
    return torch.tensor(outputs)


embeddings = get_embeddings_openai([
       "Jack Morris is a PhD student at Cornell Tech in New York City",
       "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
])


vec2text.invert_embeddings(
    embeddings=embeddings.cuda(),
    corrector=corrector
)
['Morris is a PhD student at Cornell University in New York City',
 'It was the age of incredulity, the age of wisdom, the age of apocalypse, the age of apocalypse, it was the age of faith, the age of best faith, it was the age of foolishness']

This function also takes the same optional hyperparameters, num_steps and sequence_beam_width.

Similarly, you can invert gtr-base embeddings with the following example:

import vec2text
import torch
from transformers import AutoModel, AutoTokenizer, PreTrainedTokenizer, PreTrainedModel


def get_gtr_embeddings(text_list,
                       encoder: PreTrainedModel,
                       tokenizer: PreTrainedTokenizer) -> torch.Tensor:

    inputs = tokenizer(text_list,
                       return_tensors="pt",
                       max_length=128,
                       truncation=True,
                       padding="max_length",).to("cuda")

    with torch.no_grad():
        model_output = encoder(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
        hidden_state = model_output.last_hidden_state
        embeddings = vec2text.models.model_utils.mean_pool(hidden_state, inputs['attention_mask'])

    return embeddings


encoder = AutoModel.from_pretrained("sentence-transformers/gtr-t5-base").encoder.to("cuda")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/gtr-t5-base")
corrector = vec2text.load_pretrained_corrector("gtr-base")

embeddings = get_gtr_embeddings([
       "Jack Morris is a PhD student at Cornell Tech in New York City",
       "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
], encoder, tokenizer)

vec2text.invert_embeddings(
    embeddings=embeddings.cuda(),
    corrector=corrector,
    num_steps=20,
)
['Jack Morris Morris is a PhD student at  Cornell Tech in New York City ',
'It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of foolishness']

Interpolation

You can mix two embeddings together for interesting results. Given embeddings of the previous two inputs, we can invert their mean:

vec2text.invert_embeddings(
    embeddings=embeddings.mean(dim=0, keepdim=True).cuda(),
    corrector=corrector
)
['Morris was in the age of physics, the age of astronomy, the age of physics, the age of physics PhD at New York']

Or do linear interpolation (this isn't particularly interesting, feel free to submit a PR with a cooler example):

import numpy as np

for alpha in np.arange(0.0, 1.0, 0.1):
  mixed_embedding = torch.lerp(input=embeddings[0], end=embeddings[1], weight=alpha)
  text = vec2text.invert_embeddings(
      embeddings=mixed_embedding[None].cuda(),
      corrector=corrector,
      num_steps=20,
      sequence_beam_width=4,
  )[0]
  print(f'alpha={alpha:.1f}\t', text)

alpha=0.0	 Jack Morris is a PhD student at Cornell Tech in New York City
alpha=0.1	 Jack Morris is a PhD student at Cornell Tech in New York City
alpha=0.2	 Jack Morris is a PhD student at Cornell Tech in New York City
alpha=0.3	 Jack Morris is a PhD student at Cornell Institute of Technology in New York City
alpha=0.4	 Jack Morris was a PhD student at Cornell Tech in New York City It is the epoch of wisdom, it is the epoch of incredulity
alpha=0.5	 Jack Morris is a Ph.D. student at Cornell Tech in New York City It was the epoch of wisdom, it was the epoch of incredulity, it was the epoch of times
alpha=0.6	 James Morris is a PhD student at New York Tech It was the epoch of wisdom, it was the age of incredulity, it was the best of times
alpha=0.7	 It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of incredulity, it was the epoch of incredulity at Morris, Ph.D
alpha=0.8	 It was the best of times, it was the worst of times, it was the epoch of wisdom, it was the age of incredulity, it was the age of incredulity
alpha=0.9	 It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of incredulity, it was the age of belief, it was the epoch of foolishness

Training a model

Most of the code in this repository facilitates training inversion models, which happens in essentially three steps:

  1. Training a 'zero-step' model to generate text from embeddings
  2. Using the zero-step model to generate 'hypotheses', the training data for the correction model
  3. Training a correction model conditioned on (true embedding, hypothesis, hypothesis embedding) tuples to generate corrected text

Steps 2 and 3 happen together by simply executing the training script. Our code also supports precomputing hypotheses using DDP, which is useful because hypothesis generation on the full MSMARCO can take quite some time (even a few days) on a single GPU. Also note that you'll need a good amount of disk space; for example, storing full-precision ada-2 embeddings for all 8.8m documents from MSMARCO takes 54 GB of disk space.

Example: training a GTR corrector

Here's how you might train the zero-step model for GTR:

python run.py --per_device_train_batch_size 128 --per_device_eval_batch_size 128 --max_seq_length 128 --model_name_or_path t5-base --dataset_name msmarco --embedder_model_name gtr_base --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_wandb=1 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./saves/gtr-1 --save_steps 2000

Note that there are a lot of options to change things about the data and model architecture. If you want to train the small GTR inverter from the paper, this command will work, but you'll have to reduce the maximum sequence length to 32. Once this model trains, add its path to the file aliases.py along with the key gtr_msmarco__msl128__100epoch and then run the following command to train the corrector:

python run.py --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --max_seq_length 128 --model_name_or_path t5-base --dataset_name msmarco --embedder_model_name gtr_base --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_wandb=1 --use_frozen_embeddings_as_input True --experiment corrector --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./saves/gtr-corrector-1 --save_steps 2000 --corrector_model_alias gtr_msmarco__msl128__100epoch

If using DDP, run the same command using torchrun run.py instead of python run.py. You can upload these models to the HuggingFace using our script by running python scripts/upload_model.py <model_alias> <model_hub_name>.

Pre-trained models

Currently we only support models for inverting OpenAI text-embedding-ada-002 embeddings but are hoping to add more soon. (We can provide the GTR inverters used in the paper upon request.)

Our models come in one of two forms: a zero-step 'hypothesizer' model that makes a guess for what text is from an embedding and a 'corrector' model that iteratively corrects and re-embeds text to bring it closer to the target embedding. We also support sequence-level beam search which makes multiple corrective guesses at each step and takes the one closest to the ground-truth embedding.

How to upload a pre-trained model to the HuggingFace model hub

  1. Add your model to CHECKPOINT_FOLDERS_DICT in aliases.py. This tells our codebase (i) what the name (alias) of your model is and (ii) the folder where its weights are stored.
  2. Log into the model hub using huggingface-cli login
  3. From the project root directory, run python scripts/upload_model.py <model_alias> <hf_alias> where <model_alias> is the key of the model you added to aliases.py and <hf_alias> will be the model's name on HuggingFace

pre-commit

pip install isort black flake8 mypy --upgrade

pre-commit run --all

Evaluate the models from the papers

Here's how to load and evaluate the sequence-length 32 GTR inversion model in the paper:

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)

val_datasets = experiment._load_val_datasets_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

Sample model-training command for Language Model Inversion

This repository was also used to train language model inverters for our paper Language Model Inversion.

This is the dataset of prompts used for training (referred two as "Two Million Instructions" in the manuscript but One Million Instructions on HuggingFace): https://huggingface.co/datasets/wentingzhao/one-million-instructions

Here is a sample command for training a language model inverter:

python vec2text/run.py --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_seq_length 128 --num_train_epochs 100 --max_eval_samples 1000 --eval_steps 25000 --warmup_steps 100000 --learning_rate 0.0002 --dataset_name one_million_instructions --model_name_or_path t5-base --use_wandb=0 --embedder_model_name gpt2 --experiment inversion_from_logits_emb --bf16=1 --embedder_torch_dtype float16 --lr_scheduler_type constant_with_warmup --use_frozen_embeddings_as_input 1 --mock_embedder 0

Pre-trained models

The models used for our Language Model Inversion paper are available for download from HuggingFace. Here is the LLAMA-2 base inverter and the LLAMA-2 chat inverter. Those models can also be pre-trained from scratch using this repository (everything you need should be downloaded automatically from HuggingFace).

The training dataset of 2.33M prompts is available here: https://huggingface.co/datasets/wentingzhao/one-million-instructions As well as our Private Prompts synthetic evaluation data: https://huggingface.co/datasets/jxm/private_prompts

Example

Here's an example of how to evaluate on the Python-Alpaca dataset:

from vec2text import analyze_utils
experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
    "jxm/t5-base__llama-7b__one-million-instructions__emb"
)
trainer.model.use_frozen_embeddings_as_input = False
trainer.args.per_device_eval_batch_size = 16
trainer.evaluate(
    eval_dataset=trainer.eval_dataset["python_code_alpaca"].remove_columns("frozen_embeddings").select(range(200))
)

Citations

if you benefit from the code or the research, please cite our papers!

This repository includes code for two papers:

Text Embeddings Reveal (Almost) As Much As Text (EMNLP 2023)

@misc{morris2023text,
      title={Text Embeddings Reveal (Almost) As Much As Text},
      author={John X. Morris and Volodymyr Kuleshov and Vitaly Shmatikov and Alexander M. Rush},
      year={2023},
      eprint={2310.06816},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Language Model Inversion (ICLR 2024)

@misc{morris2023language,
      title={Language Model Inversion}, 
      author={John X. Morris and Wenting Zhao and Justin T. Chiu and Vitaly Shmatikov and Alexander M. Rush},
      year={2023},
      eprint={2311.13647},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

vec2text's People

Contributors

arvinzhuang avatar braceal avatar jxmorris12 avatar lbertge avatar nmontanabrown avatar sw241395 avatar zanussbaum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vec2text's Issues

Missing important refs in the paper.

Hello John!
Congratulations on your nice work that inverts OpenAI embeddings to texts. I just would like you to notice that there is a missing ref in your manuscript. The work is probably one of the earliest works which study the privacy risks of LLM embeddings (also cited by [Song et al.; 2020]).

Pan X, Zhang M, Ji S, et al. Privacy risks of general-purpose language models[C]//2020 IEEE Symposium on Security and Privacy (SP). IEEE, 2020: 1314-1331.

Hope the information can help.

Morino

Support new text-embedding-3 small and large models from OpenAI

Hey there, interesting project!

Since OpenAI introduced new and even more powerful embedding models recently, text-embedding-3-small and text-embedding-3-large, I was wondering could support for them also be added?

I'm looking into testing this out with both the original ada-2 and the better embedding-3 models.

Edit: the intended application domain would not be in the adversarial privacy attacks but instead sentence translation where getting the exact words right is not so crucial, as long as the general idea gets carried over.

Edit2: I already did a few quick tests using your provided Colab demo. Put in some Chinese sentences in, converted to embedding vectors using ada-2, and then recover in English using default pretrained models. It performed "okay-ish" with Chinese, but surprisingly using Swedish or Finnish as origin language produced weird, non-English results, as if your inversion model also included non-English tokens as training data. Or maybe the sentences I tried in those languages were out-of-distribution in relation to the pre-trained inversion or correction models, in any case, it was really just a quick test.

I think ideally, each destination language would need to have a pretrained inversion model using a dataset which only includes training samples in the destination language, and we'd just switch around the decoder model depending on what the output language is supposed to be, right?

You could also go super granular in that you could pretrain a domain-specific inversion model, or even customer-specific one, if say you are serving a large client with their very specific internal jargon... no?

Importing Bug in PIP version of library

Thank you for publishing your code! I might've found a bug:

In the pip version of this library the __init__.py file in the models directory differs from the one in the git repository.
In the pip version the line from .inversion_from_logits_emb import InversionFromLogitsEmbModel is missing, leading to an import error.

Repository:

from .corrector_encoder import CorrectorEncoderModel  # noqa: F401
from .corrector_encoder_from_logits import CorrectorEncoderFromLogitsModel  # noqa: F401
from .inversion import InversionModel  # noqa: F401
from .inversion_bow import InversionModelBagOfWords  # noqa: F401
from .inversion_decoder import InversionModelDecoderOnly  # noqa: F401
from .inversion_from_logits import InversionFromLogitsModel  # noqa: F401
from .inversion_from_logits_emb import InversionFromLogitsEmbModel  # noqa: F401
from .inversion_na import InversionModelNonAutoregressive  # noqa: F401
from .model_utils import (  # noqa: F401
    EMBEDDER_MODEL_NAMES,
    EMBEDDING_TRANSFORM_STRATEGIES,
    FREEZE_STRATEGIES,
    load_embedder_and_tokenizer,
    load_encoder_decoder,
)

Pip:

from .corrector_encoder import CorrectorEncoderModel  # noqa: F401
from .corrector_encoder_from_logits import CorrectorEncoderFromLogitsModel  # noqa: F401
from .inversion import InversionModel  # noqa: F401
from .inversion_bow import InversionModelBagOfWords  # noqa: F401
from .inversion_decoder import InversionModelDecoderOnly  # noqa: F401
from .inversion_from_logits import InversionFromLogitsModel  # noqa: F401
from .inversion_na import InversionModelNonAutoregressive  # noqa: F401
from .model_utils import (  # noqa: F401
    EMBEDDER_MODEL_NAMES,
    EMBEDDING_TRANSFORM_STRATEGIES,
    FREEZE_STRATEGIES,
    load_embedder_and_tokenizer,
    load_encoder_decoder,
)

Document gtr-base support

Love this great piece of work. It was literally what I was looking for.

In the README it's mentioned that gtr-base weights are available on request, but they're actually already on HF -- just under a somewhat cryptic name.

elif embedder == "gtr-base":

In addition, it would be great if we could point to inverse models directly (e.g. full HF model name), so it'll become easier for people to contribute models.

ValueError while training corrector

Hi,

I am training a corrector following the steps in README. However, I got this error

File "/home/xxx/vec2text/vec2text/experiments.py", line 759, in load_trainer
   ) = vec2text.aliases.load_experiment_and_trainer_from_alias(
 File "/home/xxx/vec2text/vec2text/aliases.py", line 68, in load_experiment_and_trainer_from_alias
   experiment, trainer = vec2text.analyze_utils.load_experiment_and_trainer(
 File "/home/xxx/vec2text/vec2text/analyze_utils.py", line 111, in load_experiment_and_trainer
   trainer._load_from_checkpoint(checkpoint)
 File "/home/xxx/vec2text/vec2text/trainers/base.py", line 539, in _load_from_checkpoint
   raise ValueError(
ValueError: Can't find a valid checkpoint at /home/xxx/vec2text/saves/inverters/bert/checkpoint-58000

This might be a problem from transformers. Which version of transformers are you using ?

Thank you!

Missing Pooling Layer in sentence transformers

This is partially an issue for sentence_transformer or huggingface. Sentence Transformers loaded from huggingface hub only return the encoder and tokenizer. For example, you have to rely on this function to apply a final mean pooling layer to generate the embeddings:

def get_gtr_embeddings(text_list,
                       encoder: PreTrainedModel,
                       tokenizer: PreTrainedTokenizer) -> torch.Tensor:

    inputs = tokenizer(text_list,
                       return_tensors="pt",
                       max_length=128,
                       truncation=True,
                       padding="max_length",).to("cuda")

    with torch.no_grad():
        model_output = encoder(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
        hidden_state = model_output.last_hidden_state
        embeddings = vec2text.models.model_utils.mean_pool(hidden_state, inputs['attention_mask'])

    return embeddings

But in GTR this is actually not the embedding coming out of sentence-transformers library, because a further Dense layer is applied. While the dense layer is stored in huggingface register It's not clear you can get this from transformers library

In [30]: from sentence_transformers import SentenceTransformer
    ...: sentences = ["This is an example sentence", "Each sentence is converted"]
    ...:
    ...: stmodel = SentenceTransformer('sentence-transformers/gtr-t5-base')

In [31]: stmodel
Out[31]:
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()

Just a suggestion, would it be better to use sentence-transformers class as default, and use transformers only as fall-back?

Using Initial Inversion Model Only

If I was looking to just look at inverting the text with the initial inversion model without the corrector would it be sufficient to simply pass that model as the corrector itself to the API? Thanks!

Request for GTR Inverters Checkpoint

I am a student at Wuhan University in China, specializing in NLP research. I have been following your work on vec2text and I find it extremely valuable for my research. I kindly request your assistance in obtaining the GTR Inverters Checkpoint for academic and research purposes. I assure you that I will strictly use it for scientific study and will not employ it for any other intentions. Thank you for your consideration. Please feel free to reach me at [email protected].

llama-2-7b inversion

Very cool project and I've been having a fun time exploring the repo. I'd like to run some additional examples but I'm having a difficulty reproducing any results using your llama models. It is also a bit difficult to follow how to appropriately load these models if I only wanted to invert text embeddings like you outline with the other models (as opposed to run experiments or evaluate the models).

For example, I have your t5-base__llama-7b__one-million-instructions__emb model downloaded along with the meta-llama/Llama-2-7b-hf model in my local cache (though it is not clear if this is needed). Using analyze_utils.load_experiment_and_trainer_from_pretrained keeps throwing errors on my machine. What would be the easiest way to simply load these model(s) and test embedding and inversion when the models are stored locally (something analogous to the vec2text.invert_embeddings and vect2text.invert_strings pipeline)?

Were the gtr models trained without normalization?

For the gtr set of models (jxm/gtr__nq__32, jxm/gtr__nq__32__correct), were the models trained on sentence-transformers/gtr-t5-base embeddings without normalization? In the readme example, the function used to compute embeddings does not contain a normalization:

def get_gtr_embeddings(text_list,
                       encoder: PreTrainedModel,
                       tokenizer: PreTrainedTokenizer) -> torch.Tensor:

    inputs = tokenizer(text_list,
                       return_tensors="pt",
                       max_length=128,
                       truncation=True,
                       padding="max_length",).to("cuda")

    with torch.no_grad():
        model_output = encoder(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
        hidden_state = model_output.last_hidden_state
        embeddings = vec2text.models.model_utils.mean_pool(hidden_state, inputs['attention_mask'])

    return embeddings

This is weird to me because I was under the impression from the paper that the optimization was to minimize cosine distance which only cares about direction and is invariant to normalization. However, it seems that when I normalize the embeddings passed to the corrector, the results degrade.

PicklingError while running corrector training

Hi, sorry for the long output of the error as below, dataset_map_multi_worker seem to be working fine for other places, so I am not sure what is the issue here.

Thanks for the help in advance!

train() called โ€“ resume-from_checkpoint = None
	[None] Saving hypotheses to path saves/.cache/inversion/ad2e50a2989171a5_hypotheses.cache
Precomputing hypotheses for data (num_proc=6):   0%|          | 0/392702 [00:11<?, ? examples/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xxx/xxx//vec2text/run.py", line 16, in <module>
    main()
  File "/home/xxx/xxx//vec2text/run.py", line 12, in main
    experiment.run()
  File "/home/xxx/xxx//vec2text/experiments.py", line 152, in run
    self.train()
  File "/home/xxx/xxx//vec2text/experiments.py", line 185, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/xxx.local/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/xxx/xxx//vec2text/trainers/corrector.py", line 227, in _inner_training_loop
    self.precompute_hypotheses()
  File "/home/xxx/xxx//vec2text/trainers/corrector.py", line 212, in precompute_hypotheses
    self.train_dataset, train_cache_path = self._preprocess_dataset_hypotheses(
  File "/home/xxx/xxx//vec2text/trainers/corrector.py", line 166, in _preprocess_dataset_hypotheses
    dataset = dataset_map_multi_worker(
  File "/home/xxx/xxx//vec2text/utils/utils.py", line 119, in dataset_map_multi_worker
    return dataset.map(map_fn, *args, **kwargs)
  File "/home/xxx.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 591, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/xxx.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 556, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/xxx.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3181, in map
    for rank, done, content in iflatmap_unordered(
  File "/home/xxx.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1417, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/xxx.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1417, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/xxx.local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
  File "/home/xxx.local/lib/python3.10/site-packages/multiprocess/pool.py", line 540, in _handle_tasks
    put(task)
  File "/home/xxx.local/lib/python3.10/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/xxx.local/lib/python3.10/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 418, in dump
    StockPickler.dump(self, obj)
  File "/usr/lib/python3.10/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 902, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1212, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.10/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.10/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1453, in save_instancemethod0
    pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)
  File "/usr/lib/python3.10/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.10/pickle.py", line 717, in save_reduce
    save(state)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1212, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.10/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.10/pickle.py", line 717, in save_reduce
    save(state)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1212, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.10/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.10/pickle.py", line 717, in save_reduce
    save(state)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1212, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.10/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.10/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.10/pickle.py", line 687, in save_reduce
    save(cls)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1812, in save_type
    _save_with_postproc(pickler, (_create_type, (
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1093, in _save_with_postproc
    pickler.save_reduce(*reduction, obj=obj)
  File "/usr/lib/python3.10/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 902, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1212, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.10/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1965, in save_function
    _save_with_postproc(pickler, (_create_function, (
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1093, in _save_with_postproc
    pickler.save_reduce(*reduction, obj=obj)
  File "/usr/lib/python3.10/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 902, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 902, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1812, in save_type
    _save_with_postproc(pickler, (_create_type, (
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1093, in _save_with_postproc
    pickler.save_reduce(*reduction, obj=obj)
  File "/usr/lib/python3.10/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/lib/python3.10/pickle.py", line 902, in save_tuple
    save(element)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1212, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.10/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1965, in save_function
    _save_with_postproc(pickler, (_create_function, (
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 1107, in _save_with_postproc
    pickler._batch_setitems(iter(source.items()))
  File "/usr/lib/python3.10/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/home/xxx.local/lib/python3.10/site-packages/dill/_dill.py", line 412, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.10/pickle.py", line 589, in save
    self.save_global(obj, rv)
  File "/usr/lib/python3.10/pickle.py", line 1071, in save_global
    raise PicklingError(
_pickle.PicklingError: Can't pickle <built-in function clear_profiler_hooks>: it's not found as torch._C._dynamo.eval_frame.clear_profiler_hooks

DDP is not working

Hi,
I would love to make the DDP work for training. But I directly got this warning by running the program:

12/11/2023 16:32:07 - WARNING - vec2text.experiments - Process rank: 0, device: cuda:0, n_gpu: 1, fp16 training: False, bf16 training: False

The following is my bash script for running DDP in cloud with singularity container, is there something I should have setup but I didn't ? Any help would be appreciated.

#!/bin/bash

LANG=$1
MODEL=$2
EMBEDDER=$3
DATASET=$4
EXP_GROUP_NAME=$5
EPOCH=$6
BATCH_SIZE=$7
MAX_LENGTH=$8

export NCCL_P2P_LEVEL=NVL
echo "language $LANG"
echo "model $MODEL"
echo "embedder $EMBEDDER"
echo "dataset $DATASET"
echo "exp_group_name $EXP_GROUP_NAME"
echo "epochs $EPOCH"
echo "batch size $BATCH_SIZE"
echo "max length $MAX_LENGTH"

echo "nvidia"
nvidia-smi

torchrun -m vec2text.run --overwrite_output_dir --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --max_seq_length ${MAX_LENGTH} --model_name_or_path ${MODEL} --dataset_name ${DATASET} --embedder_model_name ${EMBEDDER} --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs ${EPOCH} --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup --exp_group_name ${EXP_GROUP_NAME} --learning_rate 0.001 --output_dir ./saves/inverters/${DATASET}_${LANG} --save_steps 2000 --use_wandb 1 --ddp_find_unused_parameters True

--gres=gpu:a40:2 is set using singularity container.

LLAMA-2 inversion models

Hi Jack,

Thanks for the great work. Do u guys plan to release the inversion models that were trained to invert LLAMA-2 7b embeddings?

It would be very helpful for me, if you could.

Sample to Invert embeddings not working (anymore)

Tried to test the "Invert embeddings with invert_embeddings" but doesn't work in Colab - seems OpenAI has changed a bit with the CreateEmbeddingResponse ?

I then tried a sample of my own with newer code from OpenAI, my goal is to calculate the difference between two embeddings and then try to get some text from this "Delta-Embedding". Code looks like that but an assertion error is thrown in inversion.py:
+++++
CODE:

import torch

from openai import OpenAI
client = OpenAI()

def get_embedding(text, model="text-embedding-ada-002") -> torch.Tensor:
text = text.replace("\n", " ")
return torch.tensor(client.embeddings.create(input = [text], model=model).data[0].embedding)

test1 = "The king is dead"
test2 = "The queen is dead"

outputEmbedding1 = get_embedding(test1, model="text-embedding-ada-002")
outputEmbedding2 = get_embedding(test2, model="text-embedding-ada-002")

difference = torch.sub(outputEmbedding1, outputEmbedding2)

print(test1, test2)
print(outputEmbedding1)
print(outputEmbedding2)
print(difference)

corrector = vec2text.load_pretrained_corrector("text-embedding-ada-002")

vec2text.invert_embeddings(
embeddings=difference,
corrector=corrector)

++++
Output (done in Colab):

The king is dead The queen is dead
tensor([-0.0026, -0.0026, -0.0074, ..., -0.0007, 0.0128, -0.0038])
tensor([-0.0150, 0.0011, -0.0189, ..., 0.0103, 0.0027, -0.0165])
tensor([ 0.0124, -0.0038, 0.0114, ..., -0.0110, 0.0101, 0.0127])

/usr/local/lib/python3.10/dist-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with truncation is True.

  • Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
  • If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with model_max_length or pass max_length when encoding/padding.
  • To avoid this warning, please instantiate this tokenizer with model_max_length set to your preferred value.
    warnings.warn(

AssertionError Traceback (most recent call last)

in <cell line: 25>()
23 corrector = vec2text.load_pretrained_corrector("text-embedding-ada-002")
24
---> 25 vec2text.invert_embeddings(
26 embeddings=difference,
27 corrector=corrector)

3 frames

/usr/local/lib/python3.10/dist-packages/vec2text/models/inversion.py in embed_and_project(self, embedder_input_ids, embedder_attention_mask, frozen_embeddings)
225 if frozen_embeddings is not None:
226 embeddings = frozen_embeddings
--> 227 assert len(embeddings.shape) == 2 # batch by d
228 elif self.embedder_no_grad:
229 with torch.no_grad():

AssertionError:

Reproducing results from paper

Hi Jack,

Thanks for the great work and sharing the code! I am trying to reproduce results from the paper and want to confirm if I am doing it correctly.

Specifically I ran the below code

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

And got the below results
{'eval_loss': 0.6015774011611938, 'eval_pred_num_tokens': 31.0, 'eval_true_num_tokens': 32.0, 'eval_token_set_precision': 0.9518449167645596, 'eval_token_set_recall': 0.9564611513833035, 'eval_token_set_f1': 0.9538292487776809, 'eval_token_set_f1_sem': 0.004178347129611342, 'eval_n_ngrams_match_1': 23.128, 'eval_n_ngrams_match_2': 20.244, 'eval_n_ngrams_match_3': 18.212, 'eval_num_true_words': 24.308, 'eval_num_pred_words': 24.286, 'eval_bleu_score': 83.32868888524891, 'eval_bleu_score_sem': 1.1145241315071208, 'eval_rouge_score': 0.9550079258714326, 'eval_exact_match': 0.578, 'eval_exact_match_sem': 0.022109039310618563, 'eval_emb_cos_sim': 0.9910151958465576, 'eval_emb_cos_sim_sem': 0.0038230661302804947, 'eval_emb_top1_equal': 0.75, 'eval_emb_top1_equal_sem': 0.11180339753627777, 'eval_runtime': 253.6454, 'eval_samples_per_second': 1.971, 'eval_steps_per_second': 0.126}

Are the numbers here supposed to correspond to "GTR - NQ - Vec2Text [20 steps]" in table 1 (row 7)? I think most of the numbers are close except for exact match for which i got a higher number (57.8 v.s. 40.2 in the paper).

Thanks again!

Sentence Transformer Embeddings

I have a question regarding the embedding with distiluse-base-multilingual-cased-v1

I used the following code but I got an error:

model = SentenceTransformer("distiluse-base-multilingual-cased-v1")

query_embedding = model.encode(["How big is London?"], convert_to_tensor=True)

corrector = vec2text.load_corrector("gtr-base")
output = vec2text.invert_embeddings(
    embeddings=query_embedding,
    corrector=corrector,
    num_steps=20,
)
print(output)

error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x512 and 768x768)

How can I invert the embedding of this model ?

Examining hypothesis embeddings

Hello,

I would like to look at the hypothesis embeddings, for example, to see how the cosine similarity changes per iteration. It looks to me like invert_embeddings() only returns the final string. Is there an easy way to do this/would you accept a PR that returns the intermediary embeddings? Thank you!

Question about embedding_transform

Hi

Thank you for presenting your research. I have a question regarding the embedding_transform in inversion.py. As per my understanding, this function corresponds to the MLP model described in your publication, responsible for transforming logits into pseudo-embeddings, or what is referred to as the 'zero-step' model. Could you elaborate on how this MLP model was trained to ensure it generates meaningful predictions? The paper and readme seem to lack detailed information on this aspect, and any additional insights you could provide would be greatly appreciated. Thank you.

Why the training is not in an iterative manner as the inference

Hi jxmorris,

Great work. I have a question about your training code.

I've observed that during the training of the corrector model, the input and output consistently follow the pattern (e, e^{0}, x^{0}) and x, where x is a text, e is its ground-truth embedding, and e^{0} and x^{0} are the initial hypothesis embedding and text. Is this correct? However, in the inference stage, you choose to iteratively recover the e into x.

I am just wondering why you do not also train in an iterative manner like training diffusion models. Due to efficiency? In other words, for each training sample (e, x), now you only train with (e, e^{0}, x^{0}, x), but actually we can train with more iterative samples like (e, e^{1}, x^{1}, x), (e, e^{2}, x^{2}, x), etc. I think this will improve the model performance.

Moreover, do you have some intuitions why just training with (e, e^{0}, x^{0}, x) can make it work for iterative inference? I think maybe it is because learning from (e, e^{0}, x^{0}) is harder than learning from (e, e^{n}, x^{n}), so the model can still achieve good performance by just learning from the hardest one?

Thanks!

Question on frozen embeddings

Hi @jxmorris12, Hopefully a quick question: When training an InversionModel have you found it helpful to use the Embedder neural network directly, or should one always prefer frozen embeddings if possible? It seems like the frozen embeddings would be much faster to use in the overall training process, but is there anything extra gained from using the model directly?

The relevant section of code: https://github.com/jxmorris12/vec2text/blob/master/vec2text/models/inversion.py#L226

if frozen_embeddings is not None:
    embeddings = frozen_embeddings
    assert len(embeddings.shape) == 2  # batch by d
elif self.embedder_no_grad:
    with torch.no_grad():
        embeddings = self.call_embedding_model(
             input_ids=embedder_input_ids,
             attention_mask=embedder_attention_mask,
        )
else:
    embeddings = self.call_embedding_model(
        input_ids=embedder_input_ids,
        attention_mask=embedder_attention_mask,
    )

Any thoughts you may have on this would be really helpful. Thanks!

What's happening in the example?

I didn't find the example that clear but I have a guess at what's happening. Might be worth spelling out something to the effect of trying to map text to an embedding and then right back to the original text, or perhaps a smattering of points in the pre-image of the embedding, not really sure because it's not clear from what's written.

My two cents.

Query about the training parameters

Hi jxmorris,

I have a query regarding the training parameters, specifically the batch size, epochs, and learning rate.

Upon reviewing your provided scripts, I observed that you employed per_device_train_batch_size=128 and num_train_epochs=100 for training the inversion model. Conversely, for the corrector model, you utilized per_device_train_batch_size=32 and num_train_epochs=100. In your published paper, you stated a batch size of 128 and mentioned using 4 A6000 GPUs for training over a maximum duration of 2 days.

In my experimental setup, I am attempting to train the inversion model using 4 A100 40G GPUs. However, I have noticed that it takes approximately a week to train the inversion model alone. Consequently, I am inquiring whether the per_device_train_batch_size be 128 or be 32 for the inversion model training.

Furthermore, there seems to be a disparity in the learning rates between your scripts, where you used 1e-3, and the information presented in your paper, indicating a learning rate of 2e-4. Could you please advise on the recommended learning rate for optimal results in my experiments?

Thanks a lot!

Languages supported?

Hello! Thanks for your great paper and sharing the codes. I have a question about the languages it supports. Does it apply to other languages, such as Chinese? Thanks!

Vector to Text inversion using MPS/CPU

Hi, Thanks for great work. I have realized that inverting vector to text is only possible using CUDA. Is there any way I can do this using MPS/CPU? I am using Macbook (Intel). Thanks.

The error I get is:

RuntimeError: Placeholder storage has not been allocated on MPS device!

Parallelize trainer base evaluation in DDP setting

When training models, the bulk of evaluation is done on the main worker. When we train with 8 GPUs, we should get around an 8x speedup on eval, which would make a difference with large evaluation sets.

The main culprit is this method: https://github.com/jxmorris12/vec2text/blob/master/vec2text/trainers/base.py#L363C5-L365C27 and the subsequent call to _get_decoded_sequences in the Base trainer class. We explicitly enumerate over an eval dataloader of the first n samples which (I think) will happen once in each worker. Instead, we should split the work among multiple GPUs.

openai api key cannot pass to vec2text package

hello, i just try your vec2text_demo(ipynb file), and change the openai api key to my own key and i test it with a simple chat, it works. However, when i run "import vec2text", it fails and said "OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable"

Sentence Transformer training support

I ran the following training script

python run.py --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_length 128
--model_name_or_path t5-small --dataset_name msmarco --embedder_model_name sentence-transformers/all-MiniLM-L6-v2 --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 1 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup --learning_rate 0.001 --output_dir ./saves/gtr-1

And I got the following traces:

File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3470, in _map_single
batch = apply_function_on_filtered_inputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3349, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sciencecw/Repos/references/vec2text/vec2text/tokenize_data.py", line 130, in embed_dataset_batch
batch["frozen_embeddings"] = model.call_embedding_model(**emb_input_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: InversionModel.call_embedding_model() got an unexpected keyword argument 'token_type_ids'
[2024-01-23 14:19:47,988] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-01-23 14:19:47,988] torch._dynamo.utils: [INFO] Function Runtimes (s)
[2024-01-23 14:19:47,988] torch._dynamo.utils: [INFO] ---------- --------------

Access to GTR model

In the readme and the code, I see that the GTR inverter is by request. I didn't find another way to contact the authors, so could you please let me know how I could get the second inverter? Thank you.

PicklingError when running corrector training example.

Hi @jxmorris12 , I got the below error when I ran the example in README:

Traceback (most recent call last):
  File "vec2text/run.py", line 16, in <module>
    main()
  File "vec2text/run.py", line 12, in main
    experiment.run()
  File "/xxx/vec2text/experiments.py", line 143, in run
    self.train()
  File "xxx/vec2text/experiments.py", line 164, in train
    torch.save(
  File "xxx/conda_envs/vec2text/lib/python3.8/site-packages/torch/serialization.py", line 619, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "xxx/conda_envs/vec2text/lib/python3.8/site-packages/torch/serialization.py", line 831, in _save
    pickler.dump(obj)
_pickle.PicklingError: Can't pickle <class 'run_args.DataArguments'>: it's not the same object as run_args.DataArguments

I found the error can be fixed by commenting out these few lines in analyze_utils.py:

import sys
import vec2text.run_args as run_args
sys.modules["run_args"] = run_args
print("run_args:", run_args)

any ideas?

DDP error when running training script on mac

I tried to run the trainer on an M1 mac and it is giving me errors related to distributed computing, which I am not familiar with. Any help would be appreciated!

python run.py --per_device_train_batch_size 128 --per_device_eval_batch_size 128 --max_seq_length 128 --model_name_or_path t5-base --dataset_name msmarco --embedder_model_name gtr_base --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup  --learning_rate 0.001 --output_dir ./saves/gtr-1
Set num workers to 0
Experiment output_dir = ./saves/gtr-1
01/22/2024 23:31:59 - WARNING - experiments - Process rank: 0, device: mps, n_gpu: 1, fp16 training: False, bf16 training: False
/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
Loading datasets with TOKENIZERS_PARALLELISM = False
>> using fast tokenizers: True True
Traceback (most recent call last):
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/utils/utils.py", line 111, in dataset_map_multi_worker
    rank = torch.distributed.get_rank()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1469, in get_rank
    default_pg = _get_default_group()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/run.py", line 16, in <module>
    main()
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/run.py", line 12, in main
    experiment.run()
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 158, in run
    self.train()
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 175, in train
    trainer = self.load_trainer()
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 631, in load_trainer
    train_dataset, eval_dataset = self.load_train_and_val_datasets(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 570, in load_train_and_val_datasets
    train_datasets = self._load_train_dataset_uncached(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 396, in _load_train_dataset_uncached
    raw_datasets[key] = dataset_map_multi_worker(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/utils/utils.py", line 118, in dataset_map_multi_worker
    kwargs["num_proc"] = kwargs.get("num_proc", len(os.sched_getaffinity(0)))
                                                    ^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'os' has no attribute 'sched_getaffinity'
[2024-01-22 23:32:03,873] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-01-22 23:32:03,873] torch._dynamo.utils: [INFO] Function    Runtimes (s)
[2024-01-22 23:32:03,873] torch._dynamo.utils: [INFO] ----------  --------------

Request for a quick start guidance

Hi, first thank you for bringing to the community such an interesting and insightful paper. While I want to play with the code, I found it really difficult to start, due to too many scripts and lack of document. Could you please provide one subsection in your RAEDME to guide us easily start training for local LLM models? I am sure by doing this, the repo will become even more popular. Many thanks in advance!

Requirements file has no versions pinned

I have been trying to use your example codes and I can get the vec2text to load on the new openai library(>0.28) but not the old one.
However the get_embeddings_openai method requires the old version <=0.28 of openai but then the vec2text imports and model loads no longer work.
This is the case no matter where I try to use this. In my local, on AWS EC2, even in your Link to Colab Demo on your README

I was going to check your requirements.txt file for clues but there are no versions pinned to any of the libraries...

Can you please share the versions on all of the required libraries which make them all work together please?

Support other embeddings

You say: "Currently we only support models for inverting OpenAI text-embedding-ada-002 embeddings but are hoping to add more soon. (We can provide the GTR inverters used in the paper upon request.)".

I want to use a different embeddings model (in my case, BAAI/bge-m3 - using 1024 dimensions) - does the training procedure work the same?

Or would I need to adjust it / use different base models?

And could you share your code for training step 0?

Question about storage path

Hi, @jxmorris12. Thanks for the great work and sharing code.

  1. I wonder where does the results of your code(data, tokenized results, embedding values, ...) are stored.
  2. And if you want to change the storage path, which line in the code should you modify?

While reproducing results from the paper, I got

No space left on device

issue on experiments.py line 403 (def _load_train_dataset_uncached)

for key in raw_datasets:
            raw_datasets[key] = dataset_map_multi_worker(
                dataset=raw_datasets[key],
                map_fn=tokenize_fn(
                    tokenizer,
                    embedder_tokenizer,
...
            )

So I tried to change 'DATASET_CACHE_PATH' in utils.py and experiments.py as below.

DATASET_CACHE_PATH = os.environ.get(
    # original: "VEC2TEXT_CACHE", os.path.expanduser("~/.cache/inversion") 
    "VEC2TEXT_CACHE", os.path.expanduser("target_path")
)

However, for some reason, the tokenized results are not stored in 'target_path'; they are still stacked in '.cache/inference'

Is it correct that all the data and embedding values... are stored in the .cache/inversion folder?
If so, are there any additional modifications that need to be made in order to store them in the path I specified?

Thanks again!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.