Giter VIP home page Giter VIP logo

retro-pytorch's Introduction

RETRO - Pytorch

Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch. This will deviate from the paper slightly, using rotary embeddings for relative positional encoding, as well as Faiss library instead of Scann.

This library leverages autofaiss for building the index and calculating the k-nearest neighbors for all chunks.

Jay Alammar explanatory blogpost

The selling point of this retriever approach is reaching GPT-3 performance at 10x less parameters. More research is definitely deserved in this area.

I have also included the features necessary to scale the retrieval transformer to 1000 layers, if the claims of DeepNet paper is to be believed.

Update: Someone on Reddit has gifted me a Gold Award. Not sure what it is, but thank you! 🙏

Update: Deepnorm has been validated at scale in a 130B model out of Tsinghua. It is now recommended that you train with use_deepnet set to True

Install

$ pip install retro-pytorch

Usage

import torch
from retro_pytorch import RETRO

retro = RETRO(
    chunk_size = 64,                         # the chunk size that is indexed and retrieved (needed for proper relative positions as well as causal chunked cross attention)
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dim
    enc_depth = 2,                           # encoder depth
    dec_dim = 796,                           # decoder model dim
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (3, 6, 9, 12),   # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25,                   # decoder feedforward dropout
    use_deepnet = True                       # turn on post-normalization with DeepNet residual scaling and initialization, for scaling to 1000 layers
)

seq = torch.randint(0, 20000, (2, 2048 + 1))      # plus one since it is split into input and labels for training
retrieved = torch.randint(0, 20000, (2, 32, 2, 128)) # retrieved tokens - (batch, num chunks, num retrieved neighbors, retrieved chunk with continuation)

loss = retro(seq, retrieved, return_loss = True)
loss.backward()

# do above for many steps

RETRO Training Wrapper

The aim of the TrainingWrapper is to process a folder of text documents into the necessary memmapped numpy arrays to begin training RETRO.

import torch
from retro_pytorch import RETRO, TrainingWrapper

# instantiate RETRO, fit it into the TrainingWrapper with correct settings

retro = RETRO(
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dimension
    enc_depth = 3,                           # encoder depth
    dec_dim = 768,                           # decoder model dimensions
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (1, 3, 6, 9),    # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25                    # decoder feedforward dropout
).cuda()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = './text_folder',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '1G'
)

# get the dataloader and optimizer (AdamW with all the correct settings)

train_dl = iter(wrapper.get_dataloader(batch_size = 2, shuffle = True))
optim = wrapper.get_optimizer(lr = 3e-4, wd = 0.01)

# now do your training
# ex. one gradient step

seq, retrieved = map(lambda t: t.cuda(), next(train_dl))

# seq       - (2, 2049)         - 1 extra token since split by seq[:, :-1], seq[:, 1:]
# retrieved - (2, 32, 2, 128)   - 128 since chunk + continuation, each 64 tokens

loss = retro(
    seq,
    retrieved,
    return_loss = True
)

# one gradient step

loss.backward()
optim.step()
optim.zero_grad()

# do above for many steps, then ...

# topk sampling with retrieval at chunk boundaries

sampled = wrapper.generate(filter_thres = 0.9, temperature = 1.0) # (1, <2049) terminates early if all <eos>

# or you can generate with a prompt, knn retrieval for initial chunks all taken care of

prompt = torch.randint(0, 1000, (1, 128))  # start with two chunks worth of sequence
sampled = wrapper.generate(prompt, filter_thres = 0.9, temperature = 1.0) # (1, <2049) terminates early if all <eos>

If you wish to force a reprocess of the training data, simply run your script with a REPROCESS=1 environment flag as so

$ REPROCESS=1 python train.py

RETRO Datasets

The RETRODataset class accepts paths to a number of memmapped numpy arrays containing the chunks, the index of the first chunk in the sequence to be trained on (in RETRO decoder), and the pre-calculated indices of the k-nearest neighbors per chunk.

You can use this to easily assemble the data for RETRO training, if you do not wish to use the TrainingWrapper from above.

Furthermore, all the functions needed to create the necessary memmapped data is in the sections to follow.

import torch
from torch.utils.data import DataLoader
from retro_pytorch import RETRO, RETRODataset

# mock data constants

import numpy as np

NUM_CHUNKS = 1000
CHUNK_SIZE = 64
NUM_SEQS = 100
NUM_NEIGHBORS = 2

def save_memmap(path, tensor):
    f = np.memmap(path, dtype = tensor.dtype, mode = 'w+', shape = tensor.shape)
    f[:] = tensor
    del f

# generate mock chunk data

save_memmap(
    './train.chunks.dat',
    np.int32(np.random.randint(0, 8192, size = (NUM_CHUNKS, CHUNK_SIZE + 1)))
)

# generate nearest neighbors for each chunk

save_memmap(
    './train.chunks.knn.dat',
    np.int32(np.random.randint(0, 1000, size = (NUM_CHUNKS, NUM_NEIGHBORS)))
)

# generate seq data

save_memmap(
    './train.seq.dat',
    np.int32(np.random.randint(0, 128, size = (NUM_SEQS,)))
)

# instantiate dataset class
# which constructs the sequence and neighbors from memmapped chunk and neighbor information

train_ds = RETRODataset(
    num_sequences = NUM_SEQS,
    num_chunks = NUM_CHUNKS,
    num_neighbors = NUM_NEIGHBORS,
    chunk_size = CHUNK_SIZE,
    seq_len = 2048,
    chunk_memmap_path = './train.chunks.dat',
    chunk_nn_memmap_path = './train.chunks.knn.dat',
    seq_memmap_path = './train.seq.dat'
)

train_dl = iter(DataLoader(train_ds, batch_size = 2))

# one forwards and backwards

retro = RETRO(
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dimension
    enc_depth = 3,                           # encoder depth
    dec_dim = 768,                           # decoder model dimensions
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (1, 3, 6, 9),    # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25                    # decoder feedforward dropout
).cuda()

seq, retrieved = map(lambda t: t.cuda(), next(train_dl))

# seq       - (2, 2049)         - 1 extra token since split by seq[:, :-1], seq[:, 1:]
# retrieved - (2, 32, 2, 128)   - 128 since chunk + continuation, each 64 tokens

loss = retro(
    seq,
    retrieved,
    return_loss = True
)

loss.backward()

Retrieval related tools

This repository will use the default tokenizer (sentencepiece) for the cased version of BERT. Embeddings will be fetched from the vanilla BERT, and can either be masked mean pooled representation, or the CLS token.

ex. masked mean pooled representation

from retro_pytorch.retrieval import bert_embed, tokenize

ids = tokenize([
    'hello world',
    'foo bar'
])

embeds = bert_embed(ids) # (2, 768) - 768 is hidden dimension of BERT

ex. CLS token representation

from retro_pytorch.retrieval import bert_embed, tokenize

ids = tokenize([
    'hello world',
    'foo bar'
])

embeds = bert_embed(ids, return_cls_repr = True) # (2, 768)

Create your chunks and chunk start indices (for calculating sequence ranges for autoregressive training) using text_folder_to_chunks_

from retro_pytorch.retrieval import text_folder_to_chunks_

stats = text_folder_to_chunks_(
    folder = './text_folder',
    glob = '**/*.txt',
    chunks_memmap_path = './train.chunks.dat',
    seqs_memmap_path = './train.seq.dat',
    doc_ids_memmap_path = './train.doc_ids.dat',  # document ids are needed for filtering out neighbors belonging to same document appropriately during computation of nearest neighbors
    chunk_size = 64,
    seq_len = 2048,
    max_chunks = 1_000_000,
    max_seqs = 100_000
)

# {'chunks': <number of chunks>, 'docs': <number of documents>, 'seqs': <number of sequences>}

Fetching Nearest Neighbors

You can turn your memmapped chunks numpy array into embeddings and a faiss index with one command

from retro_pytorch.retrieval import chunks_to_index_and_embed

index, embeddings = chunks_to_index_and_embed(
    num_chunks = 1000,
    chunk_size = 64,
    chunk_memmap_path = './train.chunks.dat'
)

query_vector = embeddings[:1]                   # use first embedding as query
_, indices = index.search(query_vector, k = 2)  # fetch 2 neighbors, first indices should be self

neighbor_embeddings = embeddings[indices]       # (1, 2, 768)

You can also directly calculate the nearest neighbor file necessary for training, with chunks_to_precalculated_knn_ command

from retro_pytorch.retrieval import chunks_to_precalculated_knn_

chunks_to_precalculated_knn_(
    num_chunks = 1000,
    chunk_size = 64,
    chunk_memmap_path = './train.chunks.dat',    # path to main chunks dataset
    doc_ids_memmap_path = './train.doc_ids.dat', # path to document ids created by text_folder_to_chunks_, used for filtering out neighbors that belong to the same document
    num_nearest_neighbors = 2,                   # number of nearest neighbors you'd like to use
    num_extra_neighbors = 10                     # fetch 10 extra neighbors, in the case that fetched neighbors are frequently from same document (filtered out)
)

# nearest neighbor info saved to ./train.chunks.knn.dat

Citations

@misc{borgeaud2022improving,
    title   = {Improving language models by retrieving from trillions of tokens}, 
    author  = {Sebastian Borgeaud and Arthur Mensch and Jordan Hoffmann and Trevor Cai and Eliza Rutherford and Katie Millican and George van den Driessche and Jean-Baptiste Lespiau and Bogdan Damoc and Aidan Clark and Diego de Las Casas and Aurelia Guy and Jacob Menick and Roman Ring and Tom Hennigan and Saffron Huang and Loren Maggiore and Chris Jones and Albin Cassirer and Andy Brock and Michela Paganini and Geoffrey Irving and Oriol Vinyals and Simon Osindero and Karen Simonyan and Jack W. Rae and Erich Elsen and Laurent Sifre},
    year  = {2022},
    eprint = {2112.04426},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
@misc{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
    year    = {2021},
    eprint  = {2104.09864},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
@article{Wang2022DeepNetST,
    title   = {DeepNet: Scaling Transformers to 1, 000 Layers},
    author  = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2203.00555}
}
@misc{zhang2021sparse,
    title   = {Sparse Attention with Linear Units},
    author  = {Biao Zhang and Ivan Titov and Rico Sennrich},
    year    = {2021},
    eprint  = {2104.07012},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

I consider always the adult life to be the continuous retrieval of childhood. - Umberto Eco

retro-pytorch's People

Contributors

edoost avatar hi-archers avatar josephcappadona avatar lucidrains avatar mitchellgordon95 avatar ncoop57 avatar soheeyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

retro-pytorch's Issues

RuntimeError: Error in void faiss::gpu::GpuIndexIVFPQ::verifySettings_()

This error occurs when trying to use TrainingWrapper. If the training data is 1 megabyte in total, no error occurs.
On larger data this error appears.

Apparently the script is trying to process all the data at once, not in batches. Because of this there is a lack of system resources.

RAM: 12 gb
VRAM: 12 gb

import torch
from retro_pytorch import RETRO, TrainingWrapper

retro = RETRO(
    chunk_size = 64,                         # the chunk size that is indexed and retrieved (needed for proper relative positions as well as causal chunked cross attention)
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dim
    enc_depth = 2,                           # encoder depth
    dec_dim = 796,                           # decoder model dim
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (3, 6, 9, 12),   # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25,                   # decoder feedforward dropout
    use_deepnet = True                       # turn on post-normalization with DeepNet residual scaling and initialization, for scaling to 1000 layers
).cuda()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = '/content/text/',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 500_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '10G'
)

Out:

processing /content/text/kxaa.txt
Downloading: "https://github.com/huggingface/pytorch-transformers/archive/main.zip" to /root/.cache/torch/hub/main.zip
Downloading: 100%
29.0/29.0 [00:00<00:00, 662B/s]
Downloading: 100%
570/570 [00:00<00:00, 14.6kB/s]
Downloading: 100%
208k/208k [00:00<00:00, 2.26MB/s]
Downloading: 100%
426k/426k [00:00<00:00, 4.60MB/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (3449121 > 512). Running this sequence through the model will result in indexing errors
Using cache found in /root/.cache/torch/hub/huggingface_pytorch-transformers_main
Downloading: 100%
416M/416M [00:09<00:00, 50.3MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

embedded XXXXX / 53893
saved .tmp/embeddings/XXXXX.npy
2022-05-17 02:34:09,316 [INFO]: Using 2 omp threads (processes), consider increasing --nb_cores if you have more
2022-05-17 02:34:09,317 [INFO]: Launching the whole pipeline 05/17/2022, 02:34:09
2022-05-17 02:34:09,321 [INFO]: Reading total number of vectors and dimension 05/17/2022, 02:34:09
100%|██████████| 108/108 [00:00<00:00, 5336.89it/s]
2022-05-17 02:34:09,465 [INFO]: There are 53893 embeddings of dim 768
2022-05-17 02:34:09,466 [INFO]: >>> Finished "Reading total number of vectors and dimension" in 0.1405 secs
2022-05-17 02:34:09,471 [INFO]: 	Compute estimated construction time of the index 05/17/2022, 02:34:09
2022-05-17 02:34:09,474 [INFO]: 		-> Train: 16.7 minutes
2022-05-17 02:34:09,478 [INFO]: 		-> Add: 0.5 seconds
2022-05-17 02:34:09,480 [INFO]: 		Total: 16.7 minutes
2022-05-17 02:34:09,481 [INFO]: 	>>> Finished "Compute estimated construction time of the index" in 0.0070 secs
2022-05-17 02:34:09,484 [INFO]: 	Checking that your have enough memory available to create the index 05/17/2022, 02:34:09
2022-05-17 02:34:09,487 [INFO]: 541.5MB of memory will be needed to build the index (more might be used if you have more)
2022-05-17 02:34:09,488 [INFO]: 	>>> Finished "Checking that your have enough memory available to create the index" in 0.0025 secs
2022-05-17 02:34:09,489 [INFO]: 	Selecting most promising index types given data characteristics 05/17/2022, 02:34:09
2022-05-17 02:34:09,490 [INFO]: 	>>> Finished "Selecting most promising index types given data characteristics" in 0.0002 secs
2022-05-17 02:34:09,499 [INFO]: 	Creating the index 05/17/2022, 02:34:09
2022-05-17 02:34:09,500 [INFO]: 		-> Instanciate the index OPQ256_1024,IVF1024_HNSW32,PQ256x8 05/17/2022, 02:34:09
2022-05-17 02:34:09,509 [INFO]: 		>>> Finished "-> Instanciate the index OPQ256_1024,IVF1024_HNSW32,PQ256x8" in 0.0089 secs
2022-05-17 02:34:09,510 [INFO]: The index size will be approximately 18.2MB
2022-05-17 02:34:09,512 [INFO]: 		-> Extract training vectors 05/17/2022, 02:34:09
2022-05-17 02:34:09,513 [INFO]: Will use 53893 vectors to train the index, that will use 903.8MB of memory
 99%|█████████▉| 107/108 [00:00<00:00, 521.43it/s]
2022-05-17 02:34:09,732 [INFO]: 		>>> Finished "-> Extract training vectors" in 0.2194 secs
2022-05-17 02:34:10,226 [INFO]: 	>>> Finished "Creating the index" in 0.7267 secs
2022-05-17 02:34:10,228 [INFO]: >>> Finished "Launching the whole pipeline" in 0.9070 secs
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-6-d42557af9f46>](https://localhost:8080/#) in <module>()
     13     knn_extra_neighbors = 100,                     # num extra neighbors to fetch
     14     max_index_memory_usage = '100m',
---> 15     current_memory_available = '10G'
     16 )

6 frames
/usr/local/lib/python3.7/dist-packages/faiss/swigfaiss.py in index_cpu_to_gpu(provider, device, index, options)
  10273 def index_cpu_to_gpu(provider, device, index, options=None):
  10274     r""" converts any CPU index that can be converted to GPU"""
> 10275     return _swigfaiss.index_cpu_to_gpu(provider, device, index, options)
  10276 
  10277 def index_cpu_to_gpu_multiple(provider, devices, index, options=None):

RuntimeError: Error in void faiss::gpu::GpuIndexIVFPQ::verifySettings_() const at /project/faiss/faiss/gpu/GpuIndexIVFPQ.cu:428: Error: 'ivfpqConfig_.interleavedLayout || IVFPQ::isSupportedPQCodeLength(subQuantizers_)' failed: Number of bytes per encoded vector / sub-quantizers (256) is not supported

TrainingWrapper does not support line breaks

Notebook
When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro

Question about the right position to encode `retrieved`

Hi, I am currently reading through the code and got confused when I reached this line:

retrieved = self.encoder(retrieved, mask = encoder_retrieved_mask, chunked_seq = embed_as_context)

image
According to Algorithm 1 in the paper (the screenshot above), doesn't this line need to go inside the decoder, under this line?

if exists(cross_attn) and exists(retrieved):

This is an example of how I think the code of decoder.forward should be.

def forward(self, x, *, context_mask = None, retrieved = None):
  encoded = False  # flag to know if p = min(P) (in the algorithm)
  ...
    if exists(cross_attn) and exists(retrieved):
      if not encoded:
        ...
        # use x (H at layer p where p = min(P)), not embed (Emb(X))
        x_as_context = repeat(x[:, :seq_index], 'b (k n) d -> (b k r) n d', n = self.chunk_size, r = num_neighbors)
        retrieved = self.encoder(retrieved, mask = encoder_retrieved_mask, chunked_seq = x_as_context)
        encoded = True

'NoneType' object is not callable

when I run the example of "RETRO Datasets", there is a wrong aboubt TypeError:

Traceback (most recent call last):
File "/home/fgq/all/RETRO/fuxian_2.py", line 58, in
retro = RETRO(
File "/home/fgq/all/RETRO/retro_pytorch/retro_pytorch.py", line 507, in init
self.encoder = Encoder(
File "/home/fgq/all/RETRO/retro_pytorch/retro_pytorch.py", line 337, in init
wrapper(Attention(dim = dim, dim_head = dim_head, heads = heads, dropout = attn_dropout, causal = causal)),
File "/home/fgq/all/RETRO/retro_pytorch/retro_pytorch.py", line 73, in init
self.norm = norm_klass(dim)
TypeError: 'NoneType' object is not callable

code

save_memmap(
'./train.chunks.dat',
np.int32(np.random.randint(0, 8192, size = (NUM_CHUNKS, CHUNK_SIZE + 1)))
)

  • generate nearest neighbors for each chunk

save_memmap(
'./train.chunks.knn.dat',
np.int32(np.random.randint(0, 1000, size = (NUM_CHUNKS, NUM_NEIGHBORS)))
)

  • generate seq data

save_memmap(
'./train.seq.dat',
np.int32(np.random.randint(0, 128, size = (NUM_SEQS,)))
)

  • instantiate dataset class
    train_ds = RETRODataset(
    num_sequences = NUM_SEQS,
    num_chunks = NUM_CHUNKS,
    num_neighbors = NUM_NEIGHBORS,
    chunk_size = CHUNK_SIZE,
    seq_len = 2048,
    chunk_memmap_path = './train.chunks.dat',
    chunk_nn_memmap_path = './train.chunks.knn.dat',
    seq_memmap_path = './train.seq.dat'
    )

Question-Answer Dataset Format ?

Hello, may I ask if you are using the model for question answering, and what format is the dataset in?

I am using the retro model, and the dataset I created is:

question: Where is the Alberta Basin located?
answer: It is located in western Canada, between latitudes 49° to 60°.

question: What is the area of the Alberta Basin?
answer: The area is approximately 748,889 square kilometers.

question: What type of basin is the Alberta Basin?
answer: It is a foreland basin.

But there seems to be an error in the following generation:
prefix = 'Where is the Alberta Basin'
prompt = torch.LongTensor(tokenizer.encode(prefix, add_special_tokens=False)).unsqueeze(0)
sampled = wrapper.generate(prompt, filter_thres = 0.1, temperature = 0.1) # (1, <2049) terminates early if all
print(sampled)
print('#######')
print(tokenizer.decode(sampled.squeeze(), skip_special_tokens=True))

The output is garbled: (Where is the Alberta Basin: 。 question : 。 question : ? answer : : 。 question : 。 question : 。 question : : 。 question : 。 question : : 。 question : 。 question : : 。 question : 。 question : : 。 question :stion : 。 question : 。 question :ion : : : 。 question : 。 question : : 。 question : : 的 ? answer : 。 question : : 。 question : 。 question : 。 question : 。 question : 。 question : : : 。 question : 。 question :stion : 。 question : 。 question : 。)
Thank you!

Model training

Thanks for your awesome work @lucidrains !

Are you aware of any efforts on reproducing the actual model training?

Double [CLS] token in the first doc chunk

I noticed when we tokenize, we set add_special_tokens to True here:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L72

which adds a [CLS] token to the beginning of the doc tokens. But when we embed the chunks with BERT, we also add a CLS token to the beginning of the chunk:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L240

So for some chunks (the first chunk in every doc) we will have two [CLS] tokens at the beginning of the chunk. I think the solution here is just to turn off add_special_tokens when going from text -> chunks? Is that correct?

Error # could not open .tmp/.index/knn.index for reading: No such file or directory

When I ran the training wrapper as below, it gave me error

wrapper = TrainingWrapper(
retro = retro, # path to retro instance
knn = 2, # knn (2 in paper was sufficient)
chunk_size = 64, # chunk size (64 in paper)
documents_path = "/home/AD/siddgarg/Retro", # path to folder of text
glob = '**/train_text_admission_v2.txt',
chunks_memmap_path = './train.chunks.dat',
seqs_memmap_path = './train.seq.dat',
doc_ids_memmap_path = './train.doc_ids.dat',
max_chunks = 1_000_000, # maximum cap to chunks
max_seqs = 100_000, # maximum seqs
knn_extra_neighbors = 100, # num extra neighbors to fetch
max_index_memory_usage = '100m',
processed_stats_json_path = './processed-stats3.json',
current_memory_available = '1G'
)

could not open .tmp/.index/knn.index for reading: No such file or directory

Clarification on Architecture

Reading the original paper, I took it that RETRO was a standard transformer (ie.. 12 layer encoder, 12 layer decoder) augmented with a DB retrieval system that included a second smaller (2 layer) encoder for the Frozen Bart encoded neighbors, where the 2 layer encoder was sort of a translator between the Bart model and the main transformer.

Looking at the model here, it looks like there is only the 2 layer retrieval encoder and not a full-size main encoder. Is that correct?

Going back and re-reading the paper it doesn't seem to explicitly say one way or the other. It seems odd to me that the model would only have the 2 layer retrieval encoder. Not only would this mean that the encoder is only 2 layers but it also means that most decoder layers have no standard cross attention to the encoder, only layers 6, 9, 12 with the new CCA setup.

Has anyone trained the model from this repo and demonstrated that it can produce the results from the original paper?

Autoregressivity

I had a question about Figure 2 and equation 3 from the paper. How does the last token of each chunk C_u being able to attend to the retrieved content E_u not break autoregressivity?

Retro-fitting a pretrained model

Hey,

Thank you for your implementation!
Is it possible to use your library to "retro-fit" a pretrained model?

I guess it would mean freezing the model during training, only fine-tuning the retrieval and cross-attention?
How would you recommend doing that?

Thanks!

How to give Prompt to trained RETRO Model?

I am following the instructions on the RETRO-pytorch GItHub repo. After training my model, how do I go about using it to generate responses?

retro = RETRO(
    chunk_size = 64,                         # the chunk size that is indexed and retrieved (needed for proper relative positions as well as causal chunked cross attention)
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dim
    enc_depth = 2,                           # encoder depth
    dec_dim = 796,                           # decoder model dim
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (3, 6, 9, 12),   # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25,                   # decoder feedforward dropout
    use_deepnet = True                       # turn on post-normalization with DeepNet residual scaling and initialization, for scaling to 1000 layers
)

seq = torch.randint(0, 20000, (2, 2048 + 1))      # plus one since it is split into input and labels for training
retrieved = torch.randint(0, 20000, (2, 32, 2, 128)) # retrieved tokens - (batch, num chunks, num retrieved neighbors, retrieved chunk with continuation)

loss = retro(seq, retrieved, return_loss = True)
loss.backward()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = './retro_training_set/',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '1G'    
)

Now when I want to give this model a text input (any prompt), how would I go about doing that? Which method or function would I use? Which model/tokenizer should I use to encode the input prompt and then decode the model output tensor? Is there a method for that?

Example Prompt:
"The movie Dune was released in"

rotary embedding question

I have a two questions about the rotary embedding implementation.

1. Divide the d-dimension space in to d/2 sub-spaces

In rotary embedding, head_dim is divided by 2 to utilize the conjugate space with sin and cos.

from rotary_embedding_torch import RotaryEmbedding

head_dim = 64
rotary_emb = RotaryEmbedding(dim=head_dim)
class RotaryEmbedding(nn.Module):
    def __init__(
        self,
        dim,
        custom_freqs = None,
        freqs_for = 'lang',
        theta = 10000,
        max_freq = 10,
        num_freqs = 1,
        learned_freq = False
    ):
        super().__init__()
        if exists(custom_freqs):
            freqs = custom_freqs
        elif freqs_for == 'lang':
            # freqs.shape == (head_dim // 2)
            freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
        ...

But the freqs of the rotary in RETRO is kind of weird. Rotary embedding in RETRO's Encoder and Decoder divides head_dim by 2 in advance and puts it as an input.

rotary_emb_dim = max(dim_head // 2, MIN_DIM_HEAD)
self.rotary_pos_emb = RotaryEmbedding(rotary_emb_dim)

And divide freq by 2 once again in the initializer as shown below.

class RotaryEmbedding(nn.Module):
def __init__(self, dim):
super().__init__()
inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer('inv_freq', inv_freq)

In this way, when head_dim=48, the shape of freqs is obtained as follows.

Because the apply_rotary_emb function concats the tensor that exceeds rot_dim, the shape of the resulting tensor is the same, but the rotary pos does not seem to be fully applied.

Hence, I think you need to modify the two lines of code as below.

>>> ASIS
            freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
<<< TOBE
            freqs = 1. / (theta ** (torch.arange(0, dim, 2).float() / dim))
  • https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retro_pytorch.py#L95
    • As shown in the confirmation code below, the above modification is the same as the existing rotary embedding implementation.
    import torch
    dim1 = hid_dim // n_heads
    dim2 = (hid_dim // n_heads) // 2
    freqs1 = 1. / (10000 ** (torch.arange(0, dim1, 2).float() / dim1))
    freqs2 = 1. / (10000 ** (torch.arange(0, dim2, 1).float() / dim2))
    assert torch.equal(freqs1, freqs2)
>>> ASIS
        inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim))
<<< TOBE
        inv_freq = 1. / (10000 ** (torch.arange(0, dim, 1).float() / dim))

2. rotate_half function

The rotary_half implementations of RETRO-pytorch and rotary-embedding-torch are slightly different.

# In rotary-embedding-torch
# https://github.com/lucidrains/rotary-embedding-torch/blob/517ee2cfeb10602032ef9d282c19851e19dd8943/rotary_embedding_torch/rotary_embedding_torch.py#L34
def rotate_half(x):
    x = rearrange(x, '... (d r) -> ... d r', r = 2)
    x1, x2 = x.unbind(dim = -1)
    x = torch.stack((-x2, x1), dim = -1)
    return rearrange(x, '... d r -> ... (d r)')
# In RETRO-pytorch
# https://github.com/lucidrains/RETRO-pytorch/blob/4f99e316458fb13a5e4f881586f8436458cf4ead/retro_pytorch/retro_pytorch.py#L104
def rotate_half(x):
    x = rearrange(x, '... (j d) -> ... j d', j = 2)
    x1, x2 = x.unbind(dim = -2)
    return torch.cat((-x2, x1), dim = -1)

In rotary, concat is stacked with [0 1 0 1 0 1 0 1], and retro is stacked with [0 0 0 0 1 1 1 1].

  • [0 0 0 0] is pre-half
  • [1 1 1 1] is post-half

I wonder why it was implemented with this change! (just curious)

Looking at your implementation, I am studying and matching the thesis. Thank you always :)

AttributeError: module 'faiss' has no attribute 'GpuParameterSpace'

While running this part of the code(python==3.10.12 & faiss-gpu==1.7.2):

from retro_pytorch.retrieval import chunks_to_index_and_embed

index, embeddings = chunks_to_index_and_embed(
num_chunks = 1000,
chunk_size = 64,
chunk_memmap_path = './train.chunks.dat'
)

query_vector = embeddings[:1] # use first embedding as query
_, indices = index.search(query_vector, k = 2) # fetch 2 neighbors, first indices should be self

neighbor_embeddings = embeddings[indices] # (1, 2, 768)

image

Convert embedded tokens to English

image

bert was written so I tried to convert it to bert, but the conversion did not work. How should I do this?

[Full Code]

import torch
from torch.utils.data import DataLoader
from retro_pytorch import RETRO, RETRODataset

mock data constants

import numpy as np

NUM_CHUNKS = 1000
CHUNK_SIZE = 64
NUM_SEQS = 100
NUM_NEIGHBORS = 2

def save_memmap(path, tensor):
f = np.memmap(path, dtype=tensor.dtype, mode='w+', shape=tensor.shape)
f[:] = tensor
del f

generate mock chunk data

save_memmap(
'./train.chunks.dat',
np.int32(np.random.randint(0, 8192, size=(NUM_CHUNKS, CHUNK_SIZE + 1)))
)

generate nearest neighbors for each chunk

save_memmap(
'./train.chunks.knn.dat',
np.int32(np.random.randint(0, 1000, size=(NUM_CHUNKS, NUM_NEIGHBORS)))
)

generate seq data

save_memmap(
'./train.seq.dat',
np.int32(np.random.randint(0, 128, size=(NUM_SEQS,)))
)

instantiate dataset class

which constructs the sequence and neighbors from memmapped chunk and neighbor information

train_ds = RETRODataset(
num_sequences=NUM_SEQS,
num_chunks=NUM_CHUNKS,
num_neighbors=NUM_NEIGHBORS,
chunk_size=CHUNK_SIZE,
seq_len=2048,
chunk_memmap_path='./train.chunks.dat',
chunk_nn_memmap_path='./train.chunks.knn.dat',
seq_memmap_path='./train.seq.dat'
)

Use a smaller batch size to avoid out-of-memory issues

batch_size = 1 # or any smaller value

Create a DataLoader with the specified batch size

train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

Instantiate RETRO model

retro = RETRO(
max_seq_len=2048,
enc_dim=896,
enc_depth=3,
dec_dim=768,
dec_depth=12,
dec_cross_attn_layers=(1, 3, 6, 9),
heads=8,
dim_head=64,
dec_attn_dropout=0.25,
dec_ff_dropout=0.25
).cuda()

for i, batch in enumerate(train_dl):
# Move data to GPU
seq, retrieved = map(lambda t: t.cuda(), batch)

# Forward pass
loss = retro(seq, retrieved, return_loss=True)

# Backward pass
loss.backward()

from retro_pytorch.retrieval import bert_embed, tokenize

Tokenize input text

input_texts = ['hello world', 'foo bar']
input_ids = tokenize(input_texts).cuda()

Compute BERT embeddings on the GPU

embeds = bert_embed(input_ids, return_cls_repr=True) # (2, 768)

Print or use the generated sequence (convert back to text if necessary)

print("Bert Embeddings:", embeds)

from transformers import BertTokenizer

Load the BERT tokenizer

tokenizer2 = BertTokenizer.from_pretrained('LilaBoualili/bert-vanilla')

Example BERT embeddings

bert_embeddings = embeds

Decoding Text

print(tokenizer2.decode(bert_embeddings[0], skip_special_tokens=True))

Scann vs faiss

Could you elaborate on the decision to use faiss instead of scann? In theory scann is open source too, but I'm wondering if you found easier to get the performance needed from faiss instead.

Confusions about cross attentions in encoder

In your code

x = self.norm(x)
kv_input = default(context, x)
q = self.to_q(x)
k, v = self.to_kv(kv_input).chunk(2, dim = -1)

When this class is called by Encoder, the x means retrieved chunks. In attentional mechanisms it produces q matrix, but i think it should produce k,v matrix. In encoder input sequence just lead us to make attention in retrieved chunks word.

for attn, cross_attn, ff in self.layers:
x = attn(x, mask = mask, pos_emb = q_pos_emb) + x
if exists(cross_attn):
x = cross_attn(x, context = chunked_seq, pos_emb = (q_pos_emb, k_pos_emb)) + x
x = ff(x) + x

I am revising the model to solve QA task..

Hi, I am working on your code to solve QA task.
I have a question.

Currently my dataset consists of context, question, answer(each question have pair of answer & context).
The length of question is too short(mostly, after tokenizing with 'bert-base-multilingual', 6 to 7 tokens are generated). so I add to make model run because default model setting requires 'chunk_size = 64'.

here is my question.
I think the role of token is just added to fill out empty space. I think that means logically is nothing.
Once training, I made two chunks and put those in. First chunk is made by "question: blahblahblah" and second chunk is made by "answer: blahblah"(those sentences are too short.. as I mentioned earlier)

Once I put input such as "question: blahblahblah? answer: " into the checkpoint model, the model spit out [CLS] question : blahblahblah? answer: [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD].
I have no clue why my model stay dumb...

Question: residual connect after `ChunkedCrossAttention`?

In the end of ChunkedCrossAttention, output padding 0's:

# pad back to original, with 0s at the beginning (which will be added to the residual and be fine)
out = F.pad(out, (0, 0, causal_padding, -causal_padding + seq_remain_len), value = 0.)
return out

the comment saied will be added to the residual, but
the next is ffn,I don‘t find residual connect. (in Decoder forward):

x = cross_attn(
x,
context = retrieved,
context_mask = context_mask,
pos_emb = cross_attn_pos_emb
)
x = ff(x)

Did i miss something?

Error Reconstructing FAISS Index

Hiya! Thanks for making this library out in the open!

I've been trying to get your training wrapper working, but when it tries to generate the index, I get the following error:

RuntimeError: Error in virtual void faiss::Index::reconstruct(faiss::Index::idx_t, float*) const at /project/faiss/faiss/Index.cpp:48: reconstruct not implemented for this type of index

To reproduce, you can use this google colab: https://colab.research.google.com/drive/1BcOtBpWBGmXX_tOC7WKcHOa9SukWEKpf?usp=sharing

Any help with this would be greatly appreciated!

Confusions about cross attentions in encoder and decoder

Hi, thanks for this wonderful repo. But I got confusions about the cross attention modules in encoder and decoder.
In retro_pytorch.py
has_cross_attn = not exists(cross_attn_layers) or layer_num in cross_attn_layers
at line 272 and
has_cross_attn = not exists(cross_attn_layers) or layer_num in cross_attn_layers
at line 321, when cross_attn_layers is None, it'll introduce extra attention ops in encoder and unused corss attention modules in decoder. Is this a bug? Or did i miss something?
In my understanding, it should be
has_cross_attn = exists(cross_attn_layers) and (ayer_num in cross_attn_layers)
Am I right?

Why are there so many position embeddings?

Hi! Thanks for your great work, it's very helpful for my project! I was just curious why there are so many position embeddings. Essentially it looks like the sequence is also being added a (1 to n) pos emb initially in the RETRO class, and then in each attention module rotary embeddings are added again. I thought just two in the Attention and CCA would be quite enough. Thanks in advance!

Extra layer encoder_output_to_decoder_dim cause issue with distributed training

Hiya, hope Ice Cream is doing well, as well as you of course!

I've been trying to get distributed training working with your library and I discovered this additional Linear layer encoder_output_to_decoder_dim not being used any where:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retro_pytorch.py#L491

It seems to be a copy of the layer defined right above it to_decoder_model_dim, which does get used. Having this extra layer that is not part of the loss calculation causes the following error with data parallelism:

[RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.](https://github.com/pytorch/pytorch/issues/43259#)

Not sure if this layer is supposed to be there and it just didn't get used or if it is there by accident, so wanted to ask 🤓

Huggingface model

Any plan to release a Huggingface RETRO-pytorch model's wrapper?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.