Giter VIP home page Giter VIP logo

semdedup's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

semdedup's Issues

How can I create embeddings?

I see the "template":

from compute_pretrained_embeddings import get_embeddings

model = ...
dataloader = ...

path_str_type = ...
emb_memory_loc = ...
paths_memory_loc = ...
dataset_size = ...
emb_size = ...
emb_array = np.memmap(emb_memory_loc, dtype='float32', mode='w+', shape=(dataset_size, emb_size))
path_array = np.memmap(emb_memory_loc, dtype=path_str_type, mode='w+', shape=(dataset_size,))

get_embeddings(model, dataloader, emd_memmap, paths_memmap)

How do I initialize these variables? Are there any pre-requisites you have not mentioned (like installing "transformers" library, etc.)? Where do I get values for path_str_type, emb_memory_loc...? Is there a sample full script?

T oclarify: I have a text file and I want to semdedup the sentences in it. How do I get through the first step?

in advance_semdedup.py, why not a lower triangular matrix?

In the sort_clusters.py stage, all data are sorted by distance from far to near.

If we get an upper tringular matrix, then the row number > column number of each element, then we remove the point far away from the center, right?

Isn’t it more difficult to move away from the point of concentration?

Missing 'submit_semdedup_job.py' Script in README Pipeline Instructions

First of all, I'd like to express my gratitude for the excellent work you've done on this project; it's greatly appreciated.

While following the pipeline instructions in the README.md, specifically at the fourth step, "Run SemDeDup," I encountered a potential issue. The instructions mention the necessity of using a file named "submit_semdedup_job.py." However, upon checking the repository, it seems this file is not included, or I might have overlooked it.

Could you please clarify whether this file is supposed to be part of the repository? If it's missing, could you provide guidance on where to find it or how to proceed without it?

Thank you for your assistance and for the work you've put into this project.

Inquiry on Generating Embeddings for Long Documents using 125M OPT Model

Hi,

Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.

To perform SemDeDup, we pass documents through the open-sourced pre-trained 125M OPT model and save the last layer embedding for the last token in the document.

In the compute_pretrained_embeddings.py, the model output encodings is save to emd_memmap directly. Could you please provide more information about the implementation of last layer embedding for the last token?

with torch.no_grad():
    for data_batch, paths_batch, batch_indices in tqdm(dataloader):
        data_batch = data_batch.to(device)
        encodings = model(data_batch)
        emd_memmap[batch_indices] = normalize(encodings, dim=1)

Additionally, how do you handle lengthy documents? Currently, I presume there are two approaches:

  1. Truncate the document to a maximum sequence length of 2048 during tokenization, which aligns with the OPT model's sequence limit.
  2. Divide the document into smaller chunks and feed them into the model, then compute the mean embeddings.

In advance, I appreciate your assistance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.