Giter VIP home page Giter VIP logo

Comments (16)

vthorsteinsson avatar vthorsteinsson commented on May 12, 2024

@rsepassi The intent of the PR was precisely (a) to avoid the danger of SubwordTextEncoder producing wordpieces that were split in the middle of UTF-8 byte sequences (and of course (b) to make the tokenizer and the vocabulary work correctly/efficiently with all Unicode punctuation and separators, for instance different types of double quotes and em dashes). The implementation guarantees that wordpieces are always split on whole Unicode characters. It also ensures that all characters seen by the tokenizer in the vocabulary building phase are included in the vocabulary as single-character subtokens, so all text using that alphabet is guaranteed to be representable and encodable, with invertibility. However, on invertibility, the tradeoff is that Unicode characters that do NOT appear in the vocabulary building phase, but are subsequently seen in training*) or inference data, are mapped to Unicode REPLACEMENT_CHARACTER. If this happens, it may be an indicator that the vocabulary is being built with a too-small subset of the training data.

All that being said, it may well make sense to have two SubwordTextEncoders, one Latin-1/UTF-8 friendly and ALWAYS invertible, and another one that is Unicode friendly but may not be invertible in borderline cases.

*) Note that currently T2T builds subword vocabularies using only a part of the training data files; it cuts off after a certain number of bytes have been read from each file.

from tensor2tensor.

vthorsteinsson avatar vthorsteinsson commented on May 12, 2024

In fact it might be a possibility to get the advantages of (b) as described above even in a SubwordTextEncoder that yields UTF-8 (and decodes it back). If the tokenizer works in Unicode but the wordpieces are converted to UTF-8 after generation - and all byte values 0..255 are included in the vocabulary - that could be a good compromise. Hmm...

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on May 12, 2024

I like that idea more than having 2 encoders. Having 2 means people will mistakenly use one and not the other and a lot of chaos comes. If it's possible to have 1 that's always right, it seems a small price to pay to have 256 extra symbols in a token vocabulary, I think.

from tensor2tensor.

vthorsteinsson avatar vthorsteinsson commented on May 12, 2024

OK, I'll have a look at it and report back!

from tensor2tensor.

nshazeer avatar nshazeer commented on May 12, 2024

from tensor2tensor.

vthorsteinsson avatar vthorsteinsson commented on May 12, 2024

@nshazeer The vocabulary is already sorted by decreasing frequency, except that the single-character subtokens always come at the end (and UNICODE_REPLACEMENT_CHARACTER always last).

Sounds to me like a bytes-based UTF-8 encode/decoder on top of a Unicode-aware tokenizer would be simpler than what you propose, but let's see how it goes.

from tensor2tensor.

vthorsteinsson avatar vthorsteinsson commented on May 12, 2024

@rsepassi But just so we are clear, the problem with the pure bytes-based UTF-8 encoding as it was originally implemented is illustrated here (around line 360 in text_encoder.py):

        for start in starts:
          for end in xrange(start + 1, len(escaped_token) + 1):
            subtoken_string = escaped_token[start:end]
            counts[subtoken_string] += count

The code proceeds byte-by-byte to add subtoken strings of increasing lenght to the counts dictionary. This is done without considering embedded UTF-8 sequences, so subtokens are created that have trailing partial UTF-8 sequences. UTF-8 sequences may be up to 4 bytes in length.

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on May 12, 2024

I pushed out a correction, hopefully it's ok now (on head and in 1.0.10). I somehow couldn't update pypi yet, but everywhere else it's hopefully right now :). Take a look.

from tensor2tensor.

vthorsteinsson avatar vthorsteinsson commented on May 12, 2024

Read through it. Looks very good! I like the trisection of 1) proper subword tokens (len > 1); 2) in-alphabet Unicode characters (len = 1); and 3) out-of-alphabet Unicode characters that are reversibly transcoded. Only a single remaining thought: Possibly the _unescape_token() function can be made more efficient by special-casing the very rare event of an out-of-alphabet Unicode character being present in the string.

from tensor2tensor.

stefan-it avatar stefan-it commented on May 12, 2024

FYI: With my own training data, the following error message appears:

INFO:tensorflow:Iteration 3
INFO:tensorflow:vocab_size = 6101
[452, 2264, 6100, 51, 1128, 1787, 27, 44, 3, 1141, 6000, 1196, 6001, 974, 6013, 6010, 3232, 711, 1234, 4]
[u'This_', u'sentence', u'\ufffd', u'was_', u'enc', u'od', u'ed_', u'by_', u'the_', u'Su', u'b', u'wor', u'd', u'Te', u'x', u't', u'En', u'co', u'der_', u'._']
This sentence�was encoded by the SubwordTextEncoder.
Traceback (most recent call last):
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module>
    tf.app.run()
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main
    training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda>
    lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator
    tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 243, in get_or_generate_vocab
    vocab_size, tokenizer.token_counts, 1, 1e3)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in build_to_target_size
    return bisect(min_val, max_val)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
    other_subtokenizer = bisect(min_val, present_count - 1)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 317, in bisect
    present_count, num_iterations)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 412, in build_from_token_counts
    assert decoded == original
AssertionError

Switching back to 1.0.6 solves it :)

from tensor2tensor.

nshazeer avatar nshazeer commented on May 12, 2024

from tensor2tensor.

stefan-it avatar stefan-it commented on May 12, 2024

I used 1.0.8, but I'll try the 1.0.10 now!

from tensor2tensor.

stefan-it avatar stefan-it commented on May 12, 2024

With latest version from git (e4fe66c) I get the following error message:

/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:82: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  is_alnum = [c in self._ALPHANUMERIC_CHAR_SET for c in text]
/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:86: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if token != u" " or token_start == 0:
INFO:tensorflow:Reading file: train.en
INFO:tensorflow:Trying min_count 500
INFO:tensorflow:Iteration 0
Traceback (most recent call last):
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module>
    tf.app.run()
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main
    training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
  File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda>
    lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator
    tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 241, in get_or_generate_vocab
    vocab_size, tokenizer.token_counts, 1, 1e3)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 343, in build_to_target_size
    return bisect(min_val, max_val)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in bisect
    present_count, num_iterations)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 377, in build_from_token_counts
    escaped_token = self._escape_token(token)
  File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 476, in _escape_token
    ret += c
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

from tensor2tensor.

nshazeer avatar nshazeer commented on May 12, 2024

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on May 12, 2024

This is hopefully corrected in 1.0.11. please give it a try. I'm closing for now, please reopen if you still see the issue.

from tensor2tensor.

Danysolism avatar Danysolism commented on May 12, 2024

Hi,
Im using SubwordTextEncoder and I'm running into problems when I want to use t2t-decoder. I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 0: invalid start byte

I've already checked and I have the latest version of tensor2tensor. I was wondering i'm missing something, or if I should take into account something else when defining the problem to avoid getting a unicode error.

Thank you,
Daniela

from tensor2tensor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.