Comments (9)
Hi and thanks so much @YuhengHuang42 for reporting these. We will address both issues/suggestions by this Friday. (And sorry for the delay, I just got back from holiday).
from deezymatch.
Hi @YuhengHuang42 , thank you again for reporting the issue with n-grams and for your suggestions. It took us a bit longer than planned, but we just released v1.3.0 in which:
- BUGFIX: the issue with n-grams (#109)
- To generate vectors, we don't need three-column inputs anymore. We can have one column (or three columns for backward compatibility)
- Define word token separators in the input file (#78)
- Prefix/suffix parameter moved as part of the mode, not preprocessing, as it applied to subword tokenization
- Add specific datasets for each DeezyMatch functionality + Edit the README file.
normalizeString
andstring_split
functions are reviewed- Improve documentation
- Several tests are added
from deezymatch.
@YuhengHuang42 Could you please take a look at the new version and let us know if there are any other issues? Thank you!
from deezymatch.
@YuhengHuang42 Could you please take a look at the new version and let us know if there are any other issues? Thank you!
Hi, thanks for following up on this issue. I have tested the code on my server, everything seems to work now. Except there are two problems:
- In the config file(.yaml)
gru_lstm:
main_architecture: "gru" # rnn, gru, lstm
mode: # Tokenization mode
token_sep: "default"
prefix_suffix: ["|", "|"]
We need to add these two lines, otherwise, there might be KeyError. But I think this is expected behavior.
- For one_column_inp:
It seems for now the one-column insertion is done in the DeezyMatch/data_processing.py
file. This is done by:
tmp_split_row.insert(1, "tmp")
However, there might be some special cases that "t", "m", "p" are not in the vocabulary. So, in the end, there might still be some problems. But if the target NLP task is in English, I think this is also OK.
from deezymatch.
Hi, thanks for your quick test and for your comments.
We need to add these two lines, otherwise, there might be KeyError. But I think this is expected behavior.
- That is correct. We changed the input file, and now, we expect those two lines in the
gru_lstm
section. Some example input files are here: https://github.com/Living-with-machines/DeezyMatch/tree/master/inputs
there might be some special cases that "t", "m", "p" are not in the vocabulary. So, in the end, there might still be some problems. But if the target NLP task is in English, I think this is also OK.
- Correct, and thank you for spotting this. We need to change this as we are testing DeezyMatch in non-latin-alphabet corpora. I will make a PR soon.
from deezymatch.
@YuhengHuang42 What do you think about this solution: #118
from deezymatch.
@YuhengHuang42 What do you think about this solution: #118
Looks good to me :)
from deezymatch.
from deezymatch.
Solved in v1.3.1. I close this, but of course, please feel free to re-open or open a new issue if needed.
from deezymatch.
Related Issues (20)
- Improve documentation on DeezyMatch installation HOT 1
- Improve documentation on generating train/valid/test datasets HOT 4
- Add specific datasets for the different DM functions and adapt test notebooks HOT 2
- Add option to extend the vocabulary when fine-tuning a model
- Add post-processing filter to candidate ranking with maximum string length difference allowed HOT 1
- Linting HOT 1
- Add OCR tutorial for DH2022
- Add Heritage Gazetteer of Libya tutorial for DH2022
- Query/Candidate matching on-the-fly
- Allow disabling cosine similarity in candidate ranking
- [Tutorials] Issue with pytorch GPU
- Scaling tests
- KeyError: 'general' | Can't train a model
- pip install deezymatch HOT 1
- Paper: Efficient Tokenization-Free Encoder
- Change package dependencies from == to >= HOT 2
- Fix ranking metric documentation in candidateRanker HOT 1
- Fix the hardcoded multiplier in candidateRanker HOT 1
- Improve documentation on string normalization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deezymatch.