Hello, Could you please tell me, given a corpus, how do we decide on the number of

Number of merge operations about subword-nmt HOT 2 CLOSED

rsennrich commented on May 18, 2024

Number of merge operations

from subword-nmt.

Comments (2)

rsennrich commented on May 18, 2024 3

The empirically best number of merge operations likely depends on your dataset and language pair. We have some numbers by @bhaddow on WMT 2018 datasets for EN<->CS,ET,FI, and BLEU differences between 30,000 and 90,000 merge operations are small (although the effect on rare words might be larger), so 30,000 is a good starting point. If your dataset is relatively small (less than 1 million sentence pairs), I'd recommend that you also try out a smaller number of merge operations, since the model is unlikely to learn a useful representation of a subword that only occurs a few times.

If you have related languages, it is preferable to learn BPE units jointly on the source and target language (see https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt ). For language pairs such as English-Chinese, there is no need that the number of operations be the same.

from subword-nmt.

Anupama94 commented on May 18, 2024

Thank you for your prompt, elaborative answer. Yes my data-set is much smaller. So basically it is a value that is decided on empirically isn't it? I'll check with different values. Thank you.

from subword-nmt.

Recommend Projects

Number of merge operations about subword-nmt HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent