living-with-machines / deezymatch Goto Github PK

View Code? Open in Web Editor NEW

131.0 6.0 34.0 3.11 MB

A Flexible Deep Learning Approach to Fuzzy String Matching

Home Page: https://living-with-machines.github.io/DeezyMatch/

License: Other

Python 8.45% Jupyter Notebook 91.55%

hut23 hut23-96 hacktoberfest deep-learning machine-learning natural-language-processing nlp

deezymatch's People

Contributors

Stargazers

Watchers

deezymatch's Issues

Fix indexing issue in candidateFinder and sorting in combineVecs

Save state_dict in addition to the whole model

csv_sep is currently fixed to \t. Make this more general.

@mcollardanuy wrote:

Hi, I'm afraid in some scenarios this may potentially discard many rows if csv_sep is, for example, a comma, as it is not uncommon that the comma is part of an entity name (e.g. "Smith, John" if we try to link person names). Our solution at the moment is not sensitive to quoted text (the original code was sensitive to quoted text, but we had that strange parsing bug). That's why I was suggesting tab as the only accepted delimiter for now, because we'd rarely expect a tab to be part of a query or candidate. What do you think?

Tokenization at both character, bigrams or words level

Jupyter Notebook outputs DeezyMatch progress bar on different lines

If it is an issue with tqdm, this could be the solution: https://stackoverflow.com/questions/41707229/tqdm-printing-to-newline

Visualization of candidate finder output

Use a dimensionality reduction code, e.g., t-SNE, to visualize the outputs of candidate finder.

Implement early stopping

Currently, we run the code over the number of epochs in the input. This can be wasteful in cases where model converges after a few epochs.

Add prediction and cosine similarity to the output of candidateFinder.

Ranking without prediction

One of the comments at EMNLP was to add ranking using only FAISS / cosine (and not prediction). This can significantly speed up the ranking.

Add MAP to log_plotter

Currently, we do not plot MAP in log_plotter.

Prepare DeezyMatch tutorial, LinkedPasts

Ideas:

Run tutorials via Binder
Release model/datasets

Select the device in candidateFinder, torch.load and faiss need to be changed.

Fine-tuning trained models

Allow to further tune an already trained model on a specific dataset. It should be an option of the main DeezyMatch code.

Add a catch for when -t is out of range 0-1 (for cosine and score)

To reproduce the current problem:
DeezyMatch --deezy_mode candidate_ranker -comb ./combined/test -rm cosine -t 5 -n 5 -sz 4 -o test_candidates_deezymatch -mp ./models/finetuned_test001/finetuned_test001.model -v ./models/finetuned_test001/finetuned_test001.vocab -tn 20

cosine needs -t between 0 and 1.

Additionally, we need a catch if f there is no found candidate

candidateFinder outputs

@fedenanni wrote:

the fact that we return the values of "dl_match" and "cosine_sim" even if we use for instance "faiss_dist" to rank i find a bit confusing. The example that mariona shared on slack for instance. Maybe we could have a dictionary that associates ranking_metric.lower() with the correct dictionary to fill, we fill only that one and so the others would be empty.

This way we avoid issues with the user - if they selected "faiss" we return the ordered dict only for faiss

Package DeezyMatch to be used as a library

Make sure that the code is aligned with the input file parameters

Use of device during model inference:
- This is an example of command we use for modelInference.py:
```
python ./modelInference.py -m ../models/wgboth/wgboth.model -v ../models/wgboth/wgboth.vocab -i ../models/wgboth/input_ft.yaml -d ./query_sets/BNA-FMP.txt -mode generate_vectors -qc q -sc bna-fmp_twenty
```
  , where -i specifies the location of the input file, sc the path where we will store the vectors, which will be in inference_candidate_finder/candidates/ if -qc is c, or in inference_candidate_finder/queries/ if -qc is q.
  
  Through this command, we will be copying the input file to the output folder as well.
- This is an example of command we use for combineVecs.py:
```
python combineVecs.py -qc q -sc bna-fmp_twenty -p fwd -combs bna-fmp_twenty
```
  I think through this command we should also be copying the input file from the respective folder in inference_candidate_finder/queries/ or inference_candidate_finder/candidates/ (in this case bna-fmp_twenty), and copy it to the output directory where the combined vectors are stored (i.e. inference_candidate_finder/combined/bna-fmp_twenty, specified through -combs).
- At the moment, if we want to use CPU, the device must be hard-coded, both in combineVecs.py and candidateFinder.py. It should read the device from the input file (from the corresponding folder in inference_candidate_finder/queries/ or inference_candidate_finder/candidates/ for combinedVecs.py, and from inference_candidate_finder/combined/ for candidateFinder.py)

Candidate selection code review

Revisit self-attention

In vector generation, it seems that the performance drop after some iterations.

tqdm and progress bar. Particularly in the notebook, it seems that tqdm shows many outputs.

Single read_inputs_command with a -mode flag for train, fine-tune and inference

Currently, we do the model inference separately (using modelInference.py) while training and fine-tuning are done in DeezyMatch. We need to merge these two files so we have one access point for the three main tasks.

Add an option to run Faiss on GPU

Explain candidate ranker's args better + add recommendations (e.g., set -sz not lower than -n unless -n is very large and others)

More robust way to assemble vectors in combineVecs.py

Update requirements.txt

pip install jupyter

and then

python -m ipykernel install --user --name py37deezy --display-name "Python (py37deezy)"

Fix read_csv in data processing (issue with parsing tab-delimited file)

Implement MAP (mean average precision)

Move evaluation and modelInference to DeezyMatch

One flag to read model/vocab in FT

@fedenanni wrote:

DeezyMatch training would create a folder with model+vocab if I remember correctly. Maybe, in the future in another PR, we could have a distinction similar to the one we have with -f in fine-tuning, where the default is that you point to a folder with model and vocab, but you could also point to them with separate paths (but this would not be default).

Issue with the vocab and finetuning

Epoch number in log-plotter seems to be wrong + change the ticks to integers

Allow different input scenario names for candidate and query vectors (in combineVecs)

Read and use vocab when fine-tuning / testing a model on new data to deal with missing chars

Add supervised ranking (using for instance RankLib), combining multiple ranking scores

Homogenize the metrics of candidate ranker

Replace cosine_similarity with cosine_distance = 1 - cosine_similarity
Confidence with 1 - confidence

This way, 0.0 will be the best match in all supported metrics.

Configure loss function in the input file

It will be interesting to see how triplet loss performs to train the feature vectors from the fully connected layers directly, instead of using a sigmoid unit.

alias detection on the fly

@mcollardanuy wrote:

specified a model and a candidate scenario (where candidate vectors are already generated), and you can take a query as input, and it outputs the aliases

This includes:

Generate the vector query on the fly
Candidate queries are already there, and the model as well

verbose flag. For each module, if verbose=False, we do not show any outputs.

Input file for inference/candidate ranking

@mcollardanuy Wrote:

Currently DeezyMatch requires this input for the candidates and query files:

Mach Loop	0	false
North Wick	0	false
Trawden	0	false
Mugswell	0	false

You get an error if one of the columns is missing, but the second and third columns are actually dummy columns in the inference step.

It would make it more user-friendly if the user could just input a file that has one string per line, as in:

Mach Loop
North Wick
Trawden
Mugswell

In addition to reading a tsv file, support list inputs for training and inference (create a pandas dataframe internally)

Column 3 accepts (case-insensitive): [true, false, 0, 1], extend this to other cases: "Correct" "Wrong"

Extend list of accepted values for positive matches.

Change data_processing.py (see in particular lines 37-43, but you may have to do other changes in subsequent lines) so it also accepts, positive, negative, correct, and wrong:

for i in range(len(df_list)):
  tmp_split_row = df_list[i].split(csv_sep)
  if str(tmp_split_row[2]).strip().lower() not in ["true", "false", "1", "0"]:
    print(f"SKIP: {df_list[i]}")
    # change the label to remove_me,
    # we drop the rows with no true|false in the label column
    tmp_split_row = f"X{csv_sep}X{csv_sep}remove_me".split(csv_sep)

Train/Valid/Test splits, revisit

maybe we should write out the test_dc (in a csv format) and use it in inference (as a separate step)? The user can also say 0% for test_dc in case the dataset is already divided beforehand.

Define word token separators in the input file

At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the .split() function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).

Allow the user to define word token separators in the input file.

You will need to change the following code or files:

string_split function in utils.py (see here):

 # ------------------- string_split --------------------
def string_split(x, tokenize=["char"], min_gram=1, max_gram=3):
  """
  Split a string using various methods.
  min_gram and max_gram are used only if "ngram" is in tokenize
  """
  tokenized_str = []
  if "char" in tokenize:
    tokenized_str += [sub_x for sub_x in x]

  if "ngram" in tokenize:
    for ngram in range(min_gram, max_gram+1):
      tokenized_str += [x[i:i+ngram] for i in range(len(x)-ngram+1)] 

  if "word" in tokenize:
    tokenized_str += x.split()

  return tokenized_strc

data_processing.py, lines 105-113 (see here):

cprint('[INFO]', bc.dgreen, "-- create vocabulary")
dataset_split["s1_unicode"] = dataset_split["s1_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
dataset_split["s2_unicode"] = dataset_split["s2_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))

The mode section in the input file (see here):

mode:    # Tokenization mode
# choices: "char", "ngram", "word"
# for example: tokenize: ["char", "ngram", "word"] or ["char", "word"] 
tokenize: ["char"]
# ONLY if "ngram" is selected in tokenize, the following args will be used:
min_gram: 2
max_gram: 3

is there a way to estimate the decision boundaries of a model for a very quick search in non-linear spaces

Freeze layers for fine-tuning

Currently, we do not freeze any layers. We need to:

output all the layers to the user
have an input in the input file to specify which layers should be freezed for fine-tuning.

Support `test` in log-plotter. Currently, we only support Train and Valid

Candidate finder

A trained model in DeezyMatch can detect if two strings are similar or not. After this step, we need a candidate finder, i.e., given one input string and a list of strings, what are the most relevant/similar strings in that list. We call this candidate finder.

Branch: feature/2-candidate-finder

living-with-machines / deezymatch Goto Github PK

deezymatch's People

Contributors

Stargazers

Watchers

Forkers

deezymatch's Issues

Recommend Projects

Recommend Topics

Recommend Org