Hi,
Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.
To perform SemDeDup, we pass documents through the open-sourced pre-trained 125M OPT model and save the last layer embedding for the last token in the document.
In the compute_pretrained_embeddings.py
, the model output encodings
is save to emd_memmap
directly. Could you please provide more information about the implementation of last layer embedding for the last token
?
with torch.no_grad():
for data_batch, paths_batch, batch_indices in tqdm(dataloader):
data_batch = data_batch.to(device)
encodings = model(data_batch)
emd_memmap[batch_indices] = normalize(encodings, dim=1)
Additionally, how do you handle lengthy documents? Currently, I presume there are two approaches:
- Truncate the document to a maximum sequence length of 2048 during tokenization, which aligns with the OPT model's sequence limit.
- Divide the document into smaller chunks and feed them into the model, then compute the mean embeddings.
In advance, I appreciate your assistance!