Giter VIP home page Giter VIP logo

deezymatch's People

Contributors

dependabot[bot] avatar fedenanni avatar kasra-hosseini avatar mcollardanuy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

deezymatch's Issues

csv_sep is currently fixed to \t. Make this more general.

@mcollardanuy wrote:

Hi, I'm afraid in some scenarios this may potentially discard many rows if csv_sep is, for example, a comma, as it is not uncommon that the comma is part of an entity name (e.g. "Smith, John" if we try to link person names). Our solution at the moment is not sensitive to quoted text (the original code was sensitive to quoted text, but we had that strange parsing bug). That's why I was suggesting tab as the only accepted delimiter for now, because we'd rarely expect a tab to be part of a query or candidate. What do you think?

Implement early stopping

Currently, we run the code over the number of epochs in the input. This can be wasteful in cases where model converges after a few epochs.

Ranking without prediction

One of the comments at EMNLP was to add ranking using only FAISS / cosine (and not prediction). This can significantly speed up the ranking.

Fine-tuning trained models

Allow to further tune an already trained model on a specific dataset. It should be an option of the main DeezyMatch code.

Add a catch for when -t is out of range 0-1 (for cosine and score)

To reproduce the current problem:
DeezyMatch --deezy_mode candidate_ranker -comb ./combined/test -rm cosine -t 5 -n 5 -sz 4 -o test_candidates_deezymatch -mp ./models/finetuned_test001/finetuned_test001.model -v ./models/finetuned_test001/finetuned_test001.vocab -tn 20

cosine needs -t between 0 and 1.

Additionally, we need a catch if f there is no found candidate

candidateFinder outputs

@fedenanni wrote:

the fact that we return the values of "dl_match" and "cosine_sim" even if we use for instance "faiss_dist" to rank i find a bit confusing. The example that mariona shared on slack for instance. Maybe we could have a dictionary that associates ranking_metric.lower() with the correct dictionary to fill, we fill only that one and so the others would be empty.

This way we avoid issues with the user - if they selected "faiss" we return the ordered dict only for faiss

Make sure that the code is aligned with the input file parameters

  1. Use of device during model inference:
    • This is an example of command we use for modelInference.py:

      python ./modelInference.py -m ../models/wgboth/wgboth.model -v ../models/wgboth/wgboth.vocab -i ../models/wgboth/input_ft.yaml -d ./query_sets/BNA-FMP.txt -mode generate_vectors -qc q -sc bna-fmp_twenty
      

      , where -i specifies the location of the input file, sc the path where we will store the vectors, which will be in inference_candidate_finder/candidates/ if -qc is c, or in inference_candidate_finder/queries/ if -qc is q.

      Through this command, we will be copying the input file to the output folder as well.

    • This is an example of command we use for combineVecs.py:

      python combineVecs.py -qc q -sc bna-fmp_twenty -p fwd -combs bna-fmp_twenty
      

      I think through this command we should also be copying the input file from the respective folder in inference_candidate_finder/queries/ or inference_candidate_finder/candidates/ (in this case bna-fmp_twenty), and copy it to the output directory where the combined vectors are stored (i.e. inference_candidate_finder/combined/bna-fmp_twenty, specified through -combs).

    • At the moment, if we want to use CPU, the device must be hard-coded, both in combineVecs.py and candidateFinder.py. It should read the device from the input file (from the corresponding folder in inference_candidate_finder/queries/ or inference_candidate_finder/candidates/ for combinedVecs.py, and from inference_candidate_finder/combined/ for candidateFinder.py)

Update requirements.txt

pip install jupyter

and then

python -m ipykernel install --user --name py37deezy --display-name "Python (py37deezy)"

One flag to read model/vocab in FT

@fedenanni wrote:

DeezyMatch training would create a folder with model+vocab if I remember correctly. Maybe, in the future in another PR, we could have a distinction similar to the one we have with -f in fine-tuning, where the default is that you point to a folder with model and vocab, but you could also point to them with separate paths (but this would not be default).

alias detection on the fly

@mcollardanuy wrote:

specified a model and a candidate scenario (where candidate vectors are already generated), and you can take a query as input, and it outputs the aliases

This includes:

  • Generate the vector query on the fly
  • Candidate queries are already there, and the model as well

Input file for inference/candidate ranking

@mcollardanuy Wrote:

Currently DeezyMatch requires this input for the candidates and query files:

Mach Loop	0	false
North Wick	0	false
Trawden	0	false
Mugswell	0	false

You get an error if one of the columns is missing, but the second and third columns are actually dummy columns in the inference step.

It would make it more user-friendly if the user could just input a file that has one string per line, as in:

Mach Loop
North Wick
Trawden
Mugswell

Column 3 accepts (case-insensitive): [true, false, 0, 1], extend this to other cases: "Correct" "Wrong"

Extend list of accepted values for positive matches.

Change data_processing.py (see in particular lines 37-43, but you may have to do other changes in subsequent lines) so it also accepts, positive, negative, correct, and wrong:

for i in range(len(df_list)):
  tmp_split_row = df_list[i].split(csv_sep)
  if str(tmp_split_row[2]).strip().lower() not in ["true", "false", "1", "0"]:
    print(f"SKIP: {df_list[i]}")
    # change the label to remove_me,
    # we drop the rows with no true|false in the label column
    tmp_split_row = f"X{csv_sep}X{csv_sep}remove_me".split(csv_sep)

Train/Valid/Test splits, revisit

maybe we should write out the test_dc (in a csv format) and use it in inference (as a separate step)? The user can also say 0% for test_dc in case the dataset is already divided beforehand.

Define word token separators in the input file

At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the .split() function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).

Allow the user to define word token separators in the input file.

You will need to change the following code or files:

  1. string_split function in utils.py (see here):

     # ------------------- string_split --------------------
    def string_split(x, tokenize=["char"], min_gram=1, max_gram=3):
      """
      Split a string using various methods.
      min_gram and max_gram are used only if "ngram" is in tokenize
      """
      tokenized_str = []
      if "char" in tokenize:
        tokenized_str += [sub_x for sub_x in x]
    
      if "ngram" in tokenize:
        for ngram in range(min_gram, max_gram+1):
          tokenized_str += [x[i:i+ngram] for i in range(len(x)-ngram+1)] 
    
      if "word" in tokenize:
        tokenized_str += x.split()
    
      return tokenized_strc
  2. data_processing.py, lines 105-113 (see here):

    cprint('[INFO]', bc.dgreen, "-- create vocabulary")
    dataset_split["s1_unicode"] = dataset_split["s1_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
    dataset_split["s2_unicode"] = dataset_split["s2_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
  3. The mode section in the input file (see here):

    mode:    # Tokenization mode
    # choices: "char", "ngram", "word"
    # for example: tokenize: ["char", "ngram", "word"] or ["char", "word"] 
    tokenize: ["char"]
    # ONLY if "ngram" is selected in tokenize, the following args will be used:
    min_gram: 2
    max_gram: 3

Freeze layers for fine-tuning

Currently, we do not freeze any layers. We need to:

  1. output all the layers to the user
  2. have an input in the input file to specify which layers should be freezed for fine-tuning.

Candidate finder

A trained model in DeezyMatch can detect if two strings are similar or not. After this step, we need a candidate finder, i.e., given one input string and a list of strings, what are the most relevant/similar strings in that list. We call this candidate finder.

Branch: feature/2-candidate-finder

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.