When I try decode a file that was not part of the training/testing set, the following

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Issue when trying to decode a file that was not part of the training about tensor2tensor HOT 6 CLOSED

tensorflow commented on May 13, 2024

Issue when trying to decode a file that was not part of the training

from tensor2tensor.

Comments (6)

mehmedes commented on May 13, 2024

Now, I got it. The problem was that the new file to be translated had not been preprocessed with the .bpe file used for training. After running the .bpe over it, it's working!

from tensor2tensor.

lukaszkaiser commented on May 13, 2024

Thanks for finding the fix!

from tensor2tensor.

cshanbo commented on May 13, 2024

Hi @lukaszkaiser , @mehmedes
I found this "issue", too. I understand that this is not a bug, because you might be able to guarantee the processed training data doesn't contain any OOV word. But I think it would be more flexible if (As I said in this issue):

More detailed README to make users easier to run experiments without triggering such exception.
Or, rewriting the codes a little bit, like,

def encode(self, sentence):
    """Converts a space-separated string of tokens to a list of ids."""
    ret = [self._token_to_id[tok] if tok in self._token_to_id \ 
           else self._token_to_id['UNK'] for tok in sentence.strip().split()]
    return ret[::-1] if self._reverse else ret

to avoid this exception.

Thank you so much

from tensor2tensor.

lukaszkaiser commented on May 13, 2024

Is this BPE-only or does it happen with tokens_32k too? We would prefer not to build any special support for external BPE as it's hard to maintain (needs perl scripts) and isn't invertible.

from tensor2tensor.

cshanbo commented on May 13, 2024

Hi @lukaszkaiser
Thank you.
If I ran the bulit-in problems such as tokens_32k or bpe_32k, it works well without corruption.

What I want to say is that, if a user want to use his/her own vocabulary, and the training data contains OOVs, it might raise this exception.

I haven't tried to use my own vocabularies to run the token_32k experiment, but I think there might be such problems, too. I'll try it and let you know if any progress.

from tensor2tensor.

lukaszkaiser commented on May 13, 2024

We try to include every character in the vocabulary in "tokens_XX" -- so hopefully it should handle at least all words composed of characters that appear in the training data. Please reopen if you see the problem again!

from tensor2tensor.

Recommend Projects

Issue when trying to decode a file that was not part of the training about tensor2tensor HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent