living-with-machines / deezymatch Goto Github PK
View Code? Open in Web Editor NEWA Flexible Deep Learning Approach to Fuzzy String Matching
Home Page: https://living-with-machines.github.io/DeezyMatch/
License: Other
A Flexible Deep Learning Approach to Fuzzy String Matching
Home Page: https://living-with-machines.github.io/DeezyMatch/
License: Other
@mcollardanuy wrote:
Hi, I'm afraid in some scenarios this may potentially discard many rows if csv_sep is, for example, a comma, as it is not uncommon that the comma is part of an entity name (e.g. "Smith, John" if we try to link person names). Our solution at the moment is not sensitive to quoted text (the original code was sensitive to quoted text, but we had that strange parsing bug). That's why I was suggesting tab as the only accepted delimiter for now, because we'd rarely expect a tab to be part of a query or candidate. What do you think?
If it is an issue with tqdm, this could be the solution: https://stackoverflow.com/questions/41707229/tqdm-printing-to-newline
Use a dimensionality reduction code, e.g., t-SNE, to visualize the outputs of candidate finder.
Currently, we run the code over the number of epochs in the input. This can be wasteful in cases where model converges after a few epochs.
One of the comments at EMNLP was to add ranking using only FAISS / cosine (and not prediction). This can significantly speed up the ranking.
Currently, we do not plot MAP in log_plotter.
Ideas:
Allow to further tune an already trained model on a specific dataset. It should be an option of the main DeezyMatch code.
To reproduce the current problem:
DeezyMatch --deezy_mode candidate_ranker -comb ./combined/test -rm cosine -t 5 -n 5 -sz 4 -o test_candidates_deezymatch -mp ./models/finetuned_test001/finetuned_test001.model -v ./models/finetuned_test001/finetuned_test001.vocab -tn 20
cosine needs -t between 0 and 1.
Additionally, we need a catch if f there is no found candidate
@fedenanni wrote:
the fact that we return the values of "dl_match" and "cosine_sim" even if we use for instance "faiss_dist" to rank i find a bit confusing. The example that mariona shared on slack for instance. Maybe we could have a dictionary that associates ranking_metric.lower() with the correct dictionary to fill, we fill only that one and so the others would be empty.
This way we avoid issues with the user - if they selected "faiss" we return the ordered dict only for faiss
This is an example of command we use for modelInference.py
:
python ./modelInference.py -m ../models/wgboth/wgboth.model -v ../models/wgboth/wgboth.vocab -i ../models/wgboth/input_ft.yaml -d ./query_sets/BNA-FMP.txt -mode generate_vectors -qc q -sc bna-fmp_twenty
, where -i
specifies the location of the input file, sc
the path where we will store the vectors, which will be in inference_candidate_finder/candidates/
if -qc
is c
, or in inference_candidate_finder/queries/
if -qc
is q
.
Through this command, we will be copying the input file to the output folder as well.
This is an example of command we use for combineVecs.py
:
python combineVecs.py -qc q -sc bna-fmp_twenty -p fwd -combs bna-fmp_twenty
I think through this command we should also be copying the input file from the respective folder in inference_candidate_finder/queries/
or inference_candidate_finder/candidates/
(in this case bna-fmp_twenty
), and copy it to the output directory where the combined vectors are stored (i.e. inference_candidate_finder/combined/bna-fmp_twenty
, specified through -combs
).
At the moment, if we want to use CPU, the device must be hard-coded, both in combineVecs.py
and candidateFinder.py
. It should read the device from the input file (from the corresponding folder in inference_candidate_finder/queries/
or inference_candidate_finder/candidates/
for combinedVecs.py
, and from inference_candidate_finder/combined/
for candidateFinder.py
)
Currently, we do the model inference separately (using modelInference.py
) while training and fine-tuning are done in DeezyMatch
. We need to merge these two files so we have one access point for the three main tasks.
pip install jupyter
and then
python -m ipykernel install --user --name py37deezy --display-name "Python (py37deezy)"
@fedenanni wrote:
DeezyMatch training would create a folder with model+vocab if I remember correctly. Maybe, in the future in another PR, we could have a distinction similar to the one we have with -f in fine-tuning, where the default is that you point to a folder with model and vocab, but you could also point to them with separate paths (but this would not be default).
cosine_similarity
with cosine_distance = 1 - cosine_similarity
1 - confidence
This way, 0.0 will be the best match in all supported metrics.
It will be interesting to see how triplet loss performs to train the feature vectors from the fully connected layers directly, instead of using a sigmoid unit.
@mcollardanuy wrote:
specified a model and a candidate scenario (where candidate vectors are already generated), and you can take a query as input, and it outputs the aliases
This includes:
@mcollardanuy Wrote:
Currently DeezyMatch requires this input for the candidates and query files:
Mach Loop 0 false
North Wick 0 false
Trawden 0 false
Mugswell 0 false
You get an error if one of the columns is missing, but the second and third columns are actually dummy columns in the inference step.
It would make it more user-friendly if the user could just input a file that has one string per line, as in:
Mach Loop
North Wick
Trawden
Mugswell
Extend list of accepted values for positive matches.
Change data_processing.py
(see in particular lines 37-43, but you may have to do other changes in subsequent lines) so it also accepts, positive
, negative
, correct
, and wrong
:
for i in range(len(df_list)):
tmp_split_row = df_list[i].split(csv_sep)
if str(tmp_split_row[2]).strip().lower() not in ["true", "false", "1", "0"]:
print(f"SKIP: {df_list[i]}")
# change the label to remove_me,
# we drop the rows with no true|false in the label column
tmp_split_row = f"X{csv_sep}X{csv_sep}remove_me".split(csv_sep)
maybe we should write out the test_dc (in a csv format) and use it in inference (as a separate step)? The user can also say 0% for test_dc in case the dataset is already divided beforehand.
At the moment, when the "word" tokenization mode is selected in the input file, words are tokenized using the .split()
function. It would be very useful to allow the user to specify which characters should be considered when tokenizing (e.g. "Brough-Ferry" is now tokenized as ["Brough-Ferry"], instead of ["Brough","Ferry"], which would be the case if "-" was specified as a word delimiter as well).
Allow the user to define word token separators in the input file.
You will need to change the following code or files:
string_split
function in utils.py
(see here):
# ------------------- string_split --------------------
def string_split(x, tokenize=["char"], min_gram=1, max_gram=3):
"""
Split a string using various methods.
min_gram and max_gram are used only if "ngram" is in tokenize
"""
tokenized_str = []
if "char" in tokenize:
tokenized_str += [sub_x for sub_x in x]
if "ngram" in tokenize:
for ngram in range(min_gram, max_gram+1):
tokenized_str += [x[i:i+ngram] for i in range(len(x)-ngram+1)]
if "word" in tokenize:
tokenized_str += x.split()
return tokenized_strc
data_processing.py
, lines 105-113 (see here):
cprint('[INFO]', bc.dgreen, "-- create vocabulary")
dataset_split["s1_unicode"] = dataset_split["s1_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
dataset_split["s2_unicode"] = dataset_split["s2_unicode"].apply(lambda x: string_split(x, tokenize=mode["tokenize"], min_gram=mode["min_gram"], max_gram=mode["max_gram"]))
The mode
section in the input file (see here):
mode: # Tokenization mode
# choices: "char", "ngram", "word"
# for example: tokenize: ["char", "ngram", "word"] or ["char", "word"]
tokenize: ["char"]
# ONLY if "ngram" is selected in tokenize, the following args will be used:
min_gram: 2
max_gram: 3
Currently, we do not freeze any layers. We need to:
A trained model in DeezyMatch can detect if two strings are similar or not. After this step, we need a candidate finder, i.e., given one input string and a list of strings, what are the most relevant/similar strings in that list. We call this candidate finder.
Branch: feature/2-candidate-finder
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.