<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

SubwordTextEncoder should be bytes-based about tensor2tensor HOT 16 CLOSED

tensorflow commented on May 12, 2024

SubwordTextEncoder should be bytes-based

from tensor2tensor.

Comments (16)

vthorsteinsson commented on May 12, 2024

@rsepassi The intent of the PR was precisely (a) to avoid the danger of SubwordTextEncoder producing wordpieces that were split in the middle of UTF-8 byte sequences (and of course (b) to make the tokenizer and the vocabulary work correctly/efficiently with all Unicode punctuation and separators, for instance different types of double quotes and em dashes). The implementation guarantees that wordpieces are always split on whole Unicode characters. It also ensures that all characters seen by the tokenizer in the vocabulary building phase are included in the vocabulary as single-character subtokens, so all text using that alphabet is guaranteed to be representable and encodable, with invertibility. However, on invertibility, the tradeoff is that Unicode characters that do NOT appear in the vocabulary building phase, but are subsequently seen in training*) or inference data, are mapped to Unicode REPLACEMENT_CHARACTER. If this happens, it may be an indicator that the vocabulary is being built with a too-small subset of the training data.

All that being said, it may well make sense to have two SubwordTextEncoders, one Latin-1/UTF-8 friendly and ALWAYS invertible, and another one that is Unicode friendly but may not be invertible in borderline cases.

*) Note that currently T2T builds subword vocabularies using only a part of the training data files; it cuts off after a certain number of bytes have been read from each file.

from tensor2tensor.

vthorsteinsson commented on May 12, 2024

In fact it might be a possibility to get the advantages of (b) as described above even in a SubwordTextEncoder that yields UTF-8 (and decodes it back). If the tokenizer works in Unicode but the wordpieces are converted to UTF-8 after generation - and all byte values 0..255 are included in the vocabulary - that could be a good compromise. Hmm...

from tensor2tensor.

lukaszkaiser commented on May 12, 2024

I like that idea more than having 2 encoders. Having 2 means people will mistakenly use one and not the other and a lot of chaos comes. If it's possible to have 1 that's always right, it seems a small price to pay to have 256 extra symbols in a token vocabulary, I think.

from tensor2tensor.

vthorsteinsson commented on May 12, 2024

OK, I'll have a look at it and report back!

from tensor2tensor.

nshazeer commented on May 12, 2024

How about we keep the Unicode version almost entirely as is, but handle invertibility by escaping before encoding and unescaping after decoding. The escape function could represent an OOV unicode character by a special character followed by a decimal representation of the unicode value and a semicolon. For example, unicode character 999 would be replaced by u"\uFFFD999;" . The unescape function could be written to not crash on malformed outptut. We would have to make sure that the special character, the digits and the semicolon are all in vocabulary. So we don't need a byte version. I will give this a shot. A couple other changes I intend to make are as follows: - enlarge the set of separator characters to include everything other than unicode letters and numbers. It's irritating to have characters like | glued to words. - sort the vocabulary by decreasing frequency.

…

On Thu, Jun 29, 2017 at 5:39 PM, Vilhjalmur Thorsteinsson < ***@***.***> wrote: OK, I'll have a look at it and report back! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#74 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcZ97p1Of4lc6mU6vYJJTIKRZNSPSVJiks5sJEPQgaJpZM4OJ6WZ> .

from tensor2tensor.

vthorsteinsson commented on May 12, 2024

@nshazeer The vocabulary is already sorted by decreasing frequency, except that the single-character subtokens always come at the end (and UNICODE_REPLACEMENT_CHARACTER always last).

Sounds to me like a bytes-based UTF-8 encode/decoder on top of a Unicode-aware tokenizer would be simpler than what you propose, but let's see how it goes.

from tensor2tensor.

vthorsteinsson commented on May 12, 2024

@rsepassi But just so we are clear, the problem with the pure bytes-based UTF-8 encoding as it was originally implemented is illustrated here (around line 360 in text_encoder.py):

        for start in starts:
          for end in xrange(start + 1, len(escaped_token) + 1):
            subtoken_string = escaped_token[start:end]
            counts[subtoken_string] += count

The code proceeds byte-by-byte to add subtoken strings of increasing lenght to the counts dictionary. This is done without considering embedded UTF-8 sequences, so subtokens are created that have trailing partial UTF-8 sequences. UTF-8 sequences may be up to 4 bytes in length.

from tensor2tensor.

lukaszkaiser commented on May 12, 2024

I pushed out a correction, hopefully it's ok now (on head and in 1.0.10). I somehow couldn't update pypi yet, but everywhere else it's hopefully right now :). Take a look.

from tensor2tensor.

vthorsteinsson commented on May 12, 2024

Read through it. Looks very good! I like the trisection of 1) proper subword tokens (len > 1); 2) in-alphabet Unicode characters (len = 1); and 3) out-of-alphabet Unicode characters that are reversibly transcoded. Only a single remaining thought: Possibly the _unescape_token() function can be made more efficient by special-casing the very rare event of an out-of-alphabet Unicode character being present in the string.

from tensor2tensor.

stefan-it commented on May 12, 2024

FYI: With my own training data, the following error message appears:

INFO:tensorflow:Iteration 3
INFO:tensorflow:vocab_size = 6101
[452, 2264, 6100, 51, 1128, 1787, 27, 44, 3, 1141, 6000, 1196, 6001, 974, 6013, 6010, 3232, 711, 1234, 4]
[u'This_', u'sentence', u'\ufffd', u'was_', u'enc', u'od', u'ed_', u'by_', u'the_', u'Su', u'b', u'wor', u'd', u'Te', u'x', u't', u'En', u'co', u'der_', u'._']
This sentence�was encoded by the SubwordTextEncoder.
Traceback (most recent call last):
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module>
    tf.app.run()
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main
    training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda>
    lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator
    tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 243, in get_or_generate_vocab
    vocab_size, tokenizer.token_counts, 1, 1e3)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in build_to_target_size
    return bisect(min_val, max_val)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 317, in bisect
    present_count, num_iterations)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 412, in build_from_token_counts
    assert decoded == original
AssertionError

Switching back to 1.0.6 solves it :)

from tensor2tensor.

nshazeer commented on May 12, 2024

What version are you trying? Version 1.0.10 removes any mention of the unicode character \ufffd, so I don't know how this is happening.

…

On Sun, Jul 2, 2017 at 2:56 AM, Stefan Schweter ***@***.***> wrote: FYI: With my own training data, the following error message appears: INFO:tensorflow:Iteration 3 INFO:tensorflow:vocab_size = 6101 [452, 2264, 6100, 51, 1128, 1787, 27, 44, 3, 1141, 6000, 1196, 6001, 974, 6013, 6010, 3232, 711, 1234, 4] [u'This_', u'sentence', u'\ufffd', u'was_', u'enc', u'od', u'ed_', u'by_', u'the_', u'Su', u'b', u'wor', u'd', u'Te', u'x', u't', u'En', u'co', u'der_', u'._'] This sentence�was encoded by the SubwordTextEncoder. Traceback (most recent call last): File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module> tf.app.run() File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train", File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda> lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15), File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 243, in get_or_generate_vocab vocab_size, tokenizer.token_counts, 1, 1e3) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in build_to_target_size return bisect(min_val, max_val) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect other_subtokenizer = bisect(min_val, present_count - 1) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect other_subtokenizer = bisect(min_val, present_count - 1) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect other_subtokenizer = bisect(min_val, present_count - 1) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect other_subtokenizer = bisect(min_val, present_count - 1) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect other_subtokenizer = bisect(min_val, present_count - 1) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect other_subtokenizer = bisect(min_val, present_count - 1) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 317, in bisect present_count, num_iterations) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 412, in build_from_token_counts assert decoded == original AssertionError Switching back to *1.0.6* solves it :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#74 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcZ97hU90ROdEJx_Z4c9RorzVaaNb-_pks5sJ2lDgaJpZM4OJ6WZ> .

from tensor2tensor.

stefan-it commented on May 12, 2024

I used 1.0.8, but I'll try the 1.0.10 now!

from tensor2tensor.

stefan-it commented on May 12, 2024

With latest version from git (e4fe66c) I get the following error message:

/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:82: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  is_alnum = [c in self._ALPHANUMERIC_CHAR_SET for c in text]
/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:86: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if token != u" " or token_start == 0:
INFO:tensorflow:Reading file: train.en
INFO:tensorflow:Trying min_count 500
INFO:tensorflow:Iteration 0
Traceback (most recent call last):
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module>
    tf.app.run()
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main
    training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda>
    lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator
    tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 241, in get_or_generate_vocab
    vocab_size, tokenizer.token_counts, 1, 1e3)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 343, in build_to_target_size
    return bisect(min_val, max_val)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in bisect
    present_count, num_iterations)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 377, in build_from_token_counts
    escaped_token = self._escape_token(token)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 476, in _escape_token
    ret += c
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

from tensor2tensor.

nshazeer commented on May 12, 2024

Sorry about this. All of the relevant people (including myself) are out of the office this week. We'll try to take a look next week. Noam

…

On Mon, Jul 3, 2017 at 6:11 PM, Stefan Schweter ***@***.***> wrote: With latest version from git (e4fe66c <e4fe66c>) I get the following error message: /tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:82: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal is_alnum = [c in self._ALPHANUMERIC_CHAR_SET for c in text] /tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:86: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if token != u" " or token_start == 0: INFO:tensorflow:Reading file: train.en INFO:tensorflow:Trying min_count 500 INFO:tensorflow:Iteration 0 Traceback (most recent call last): File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module> tf.app.run() File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train", File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda> lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15), File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 241, in get_or_generate_vocab vocab_size, tokenizer.token_counts, 1, 1e3) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 343, in build_to_target_size return bisect(min_val, max_val) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in bisect present_count, num_iterations) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 377, in build_from_token_counts escaped_token = self._escape_token(token) File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 476, in _escape_token ret += c UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#74 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcZ97iNFGiulYXV1zBGXSFYyfdIyEiYMks5sKZFDgaJpZM4OJ6WZ> .

from tensor2tensor.

lukaszkaiser commented on May 12, 2024

This is hopefully corrected in 1.0.11. please give it a try. I'm closing for now, please reopen if you still see the issue.

from tensor2tensor.

Danysolism commented on May 12, 2024

Hi,
Im using SubwordTextEncoder and I'm running into problems when I want to use t2t-decoder. I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 0: invalid start byte

I've already checked and I have the latest version of tensor2tensor. I was wondering i'm missing something, or if I should take into account something else when defining the problem to avoid getting a unicode error.

Thank you,
Daniela

from tensor2tensor.

SubwordTextEncoder should be bytes-based about tensor2tensor HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent