Hi, first of all, thanks for your great work! During using this tool

Hi and thanks so much <a class="user-mention notranslate" data-hovercard-type="user" d

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<g-emoji class="g-emoji" alias="+1" fallback-src="https://github.githubassets.com/imag

Test Tokenizer does not handle n-gram about deezymatch HOT 9 CLOSED

YuhengHuang42 commented on July 25, 2024

Test Tokenizer does not handle n-gram

from deezymatch.

Comments (9)

kasra-hosseini commented on July 25, 2024 2

Hi and thanks so much @YuhengHuang42 for reporting these. We will address both issues/suggestions by this Friday. (And sorry for the delay, I just got back from holiday).

from deezymatch.

kasra-hosseini commented on July 25, 2024

Hi @YuhengHuang42 , thank you again for reporting the issue with n-grams and for your suggestions. It took us a bit longer than planned, but we just released v1.3.0 in which:

BUGFIX: the issue with n-grams (#109)
To generate vectors, we don't need three-column inputs anymore. We can have one column (or three columns for backward compatibility)
Define word token separators in the input file (#78)
Prefix/suffix parameter moved as part of the mode, not preprocessing, as it applied to subword tokenization
Add specific datasets for each DeezyMatch functionality + Edit the README file.
normalizeString and string_split functions are reviewed
Improve documentation
Several tests are added

from deezymatch.

kasra-hosseini commented on July 25, 2024

@YuhengHuang42 Could you please take a look at the new version and let us know if there are any other issues? Thank you!

from deezymatch.

YuhengHuang42 commented on July 25, 2024

@YuhengHuang42 Could you please take a look at the new version and let us know if there are any other issues? Thank you!

Hi, thanks for following up on this issue. I have tested the code on my server, everything seems to work now. Except there are two problems:

In the config file(.yaml)

gru_lstm:
  main_architecture: "gru"    # rnn, gru, lstm
  mode:    # Tokenization mode
    token_sep: "default"
    prefix_suffix: ["|", "|"]

We need to add these two lines, otherwise, there might be KeyError. But I think this is expected behavior.

For one_column_inp:

It seems for now the one-column insertion is done in the DeezyMatch/data_processing.py file. This is done by:

tmp_split_row.insert(1, "tmp")

However, there might be some special cases that "t", "m", "p" are not in the vocabulary. So, in the end, there might still be some problems. But if the target NLP task is in English, I think this is also OK.

from deezymatch.

kasra-hosseini commented on July 25, 2024

Hi, thanks for your quick test and for your comments.

We need to add these two lines, otherwise, there might be KeyError. But I think this is expected behavior.

That is correct. We changed the input file, and now, we expect those two lines in the gru_lstm section. Some example input files are here: https://github.com/Living-with-machines/DeezyMatch/tree/master/inputs

there might be some special cases that "t", "m", "p" are not in the vocabulary. So, in the end, there might still be some problems. But if the target NLP task is in English, I think this is also OK.

Correct, and thank you for spotting this. We need to change this as we are testing DeezyMatch in non-latin-alphabet corpora. I will make a PR soon.

from deezymatch.

kasra-hosseini commented on July 25, 2024

@YuhengHuang42 What do you think about this solution: #118

from deezymatch.

YuhengHuang42 commented on July 25, 2024

@YuhengHuang42 What do you think about this solution: #118

Looks good to me :)

from deezymatch.

kasra-hosseini commented on July 25, 2024

👍 Great. We will do some more tests today and will merge the PR.

from deezymatch.

kasra-hosseini commented on July 25, 2024

Solved in v1.3.1. I close this, but of course, please feel free to re-open or open a new issue if needed.

from deezymatch.

Test Tokenizer does not handle n-gram about deezymatch HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent