facebookresearch / semdedup Goto Github PK

Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically similar, but not exactly identical).

License: Other

Python 100.00%

semdedup's People

Stargazers

Watchers

Forkers

evdcush dumpmemory marshtompcs techthiyanes lihuibng dearborn-open-ai fangarenotgnu openkitty

semdedup's Issues

How can I create embeddings?

I see the "template":

from compute_pretrained_embeddings import get_embeddings

model = ...
dataloader = ...

path_str_type = ...
emb_memory_loc = ...
paths_memory_loc = ...
dataset_size = ...
emb_size = ...
emb_array = np.memmap(emb_memory_loc, dtype='float32', mode='w+', shape=(dataset_size, emb_size))
path_array = np.memmap(emb_memory_loc, dtype=path_str_type, mode='w+', shape=(dataset_size,))

get_embeddings(model, dataloader, emd_memmap, paths_memmap)

How do I initialize these variables? Are there any pre-requisites you have not mentioned (like installing "transformers" library, etc.)? Where do I get values for path_str_type, emb_memory_loc...? Is there a sample full script?

T oclarify: I have a text file and I want to semdedup the sentences in it. How do I get through the first step?

in advance_semdedup.py, why not a lower triangular matrix?

In the sort_clusters.py stage, all data are sorted by distance from far to near.

If we get an upper tringular matrix, then the row number > column number of each element, then we remove the point far away from the center, right?

Isn’t it more difficult to move away from the point of concentration?

where is "submit_semdedup_job.py"?

Such a great job is missing an important component~

Missing 'submit_semdedup_job.py' Script in README Pipeline Instructions

First of all, I'd like to express my gratitude for the excellent work you've done on this project; it's greatly appreciated.

While following the pipeline instructions in the README.md, specifically at the fourth step, "Run SemDeDup," I encountered a potential issue. The instructions mention the necessity of using a file named "submit_semdedup_job.py." However, upon checking the repository, it seems this file is not included, or I might have overlooked it.

Could you please clarify whether this file is supposed to be part of the repository? If it's missing, could you provide guidance on where to find it or how to proceed without it?

Thank you for your assistance and for the work you've put into this project.

How to make it work for LAION scale datasets

Hi,
I'm wondering how this works with webdatasets for LAION etc?
Is there anything specific to do when scaling with the size of the dataset?

Thanks!

Inquiry on Generating Embeddings for Long Documents using 125M OPT Model

Hi,

Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.

To perform SemDeDup, we pass documents through the open-sourced pre-trained 125M OPT model and save the last layer embedding for the last token in the document.

In the compute_pretrained_embeddings.py, the model output encodings is save to emd_memmap directly. Could you please provide more information about the implementation of last layer embedding for the last token?

with torch.no_grad():
    for data_batch, paths_batch, batch_indices in tqdm(dataloader):
        data_batch = data_batch.to(device)
        encodings = model(data_batch)
        emd_memmap[batch_indices] = normalize(encodings, dim=1)

Additionally, how do you handle lengthy documents? Currently, I presume there are two approaches:

Truncate the document to a maximum sequence length of 2048 during tokenization, which aligns with the OPT model's sequence limit.
Divide the document into smaller chunks and feed them into the model, then compute the mean embeddings.

In advance, I appreciate your assistance!

facebookresearch / semdedup Goto Github PK

semdedup's People

Stargazers

Watchers

Forkers

semdedup's Issues

How can I create embeddings?

in advance_semdedup.py, why not a lower triangular matrix?

where is "submit_semdedup_job.py"?

Missing 'submit_semdedup_job.py' Script in README Pipeline Instructions

How to make it work for LAION scale datasets

Inquiry on Generating Embeddings for Long Documents using 125M OPT Model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent