Comments (16)
@rsepassi The intent of the PR was precisely (a) to avoid the danger of SubwordTextEncoder producing wordpieces that were split in the middle of UTF-8 byte sequences (and of course (b) to make the tokenizer and the vocabulary work correctly/efficiently with all Unicode punctuation and separators, for instance different types of double quotes and em dashes). The implementation guarantees that wordpieces are always split on whole Unicode characters. It also ensures that all characters seen by the tokenizer in the vocabulary building phase are included in the vocabulary as single-character subtokens, so all text using that alphabet is guaranteed to be representable and encodable, with invertibility. However, on invertibility, the tradeoff is that Unicode characters that do NOT appear in the vocabulary building phase, but are subsequently seen in training*) or inference data, are mapped to Unicode REPLACEMENT_CHARACTER. If this happens, it may be an indicator that the vocabulary is being built with a too-small subset of the training data.
All that being said, it may well make sense to have two SubwordTextEncoders, one Latin-1/UTF-8 friendly and ALWAYS invertible, and another one that is Unicode friendly but may not be invertible in borderline cases.
*) Note that currently T2T builds subword vocabularies using only a part of the training data files; it cuts off after a certain number of bytes have been read from each file.
from tensor2tensor.
In fact it might be a possibility to get the advantages of (b) as described above even in a SubwordTextEncoder that yields UTF-8 (and decodes it back). If the tokenizer works in Unicode but the wordpieces are converted to UTF-8 after generation - and all byte values 0..255 are included in the vocabulary - that could be a good compromise. Hmm...
from tensor2tensor.
I like that idea more than having 2 encoders. Having 2 means people will mistakenly use one and not the other and a lot of chaos comes. If it's possible to have 1 that's always right, it seems a small price to pay to have 256 extra symbols in a token vocabulary, I think.
from tensor2tensor.
OK, I'll have a look at it and report back!
from tensor2tensor.
from tensor2tensor.
@nshazeer The vocabulary is already sorted by decreasing frequency, except that the single-character subtokens always come at the end (and UNICODE_REPLACEMENT_CHARACTER always last).
Sounds to me like a bytes-based UTF-8 encode/decoder on top of a Unicode-aware tokenizer would be simpler than what you propose, but let's see how it goes.
from tensor2tensor.
@rsepassi But just so we are clear, the problem with the pure bytes-based UTF-8 encoding as it was originally implemented is illustrated here (around line 360 in text_encoder.py):
for start in starts:
for end in xrange(start + 1, len(escaped_token) + 1):
subtoken_string = escaped_token[start:end]
counts[subtoken_string] += count
The code proceeds byte-by-byte to add subtoken strings of increasing lenght to the counts dictionary. This is done without considering embedded UTF-8 sequences, so subtokens are created that have trailing partial UTF-8 sequences. UTF-8 sequences may be up to 4 bytes in length.
from tensor2tensor.
I pushed out a correction, hopefully it's ok now (on head and in 1.0.10). I somehow couldn't update pypi yet, but everywhere else it's hopefully right now :). Take a look.
from tensor2tensor.
Read through it. Looks very good! I like the trisection of 1) proper subword tokens (len > 1); 2) in-alphabet Unicode characters (len = 1); and 3) out-of-alphabet Unicode characters that are reversibly transcoded. Only a single remaining thought: Possibly the _unescape_token()
function can be made more efficient by special-casing the very rare event of an out-of-alphabet Unicode character being present in the string.
from tensor2tensor.
FYI: With my own training data, the following error message appears:
INFO:tensorflow:Iteration 3
INFO:tensorflow:vocab_size = 6101
[452, 2264, 6100, 51, 1128, 1787, 27, 44, 3, 1141, 6000, 1196, 6001, 974, 6013, 6010, 3232, 711, 1234, 4]
[u'This_', u'sentence', u'\ufffd', u'was_', u'enc', u'od', u'ed_', u'by_', u'the_', u'Su', u'b', u'wor', u'd', u'Te', u'x', u't', u'En', u'co', u'der_', u'._']
This sentence�was encoded by the SubwordTextEncoder.
Traceback (most recent call last):
File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module>
tf.app.run()
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main
training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda>
lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator
tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 243, in get_or_generate_vocab
vocab_size, tokenizer.token_counts, 1, 1e3)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in build_to_target_size
return bisect(min_val, max_val)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
other_subtokenizer = bisect(min_val, present_count - 1)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
other_subtokenizer = bisect(min_val, present_count - 1)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
other_subtokenizer = bisect(min_val, present_count - 1)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
other_subtokenizer = bisect(min_val, present_count - 1)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
other_subtokenizer = bisect(min_val, present_count - 1)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 323, in bisect
other_subtokenizer = bisect(min_val, present_count - 1)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 317, in bisect
present_count, num_iterations)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 412, in build_from_token_counts
assert decoded == original
AssertionError
Switching back to 1.0.6 solves it :)
from tensor2tensor.
from tensor2tensor.
I used 1.0.8, but I'll try the 1.0.10 now!
from tensor2tensor.
With latest version from git (e4fe66c) I get the following error message:
/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:82: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
is_alnum = [c in self._ALPHANUMERIC_CHAR_SET for c in text]
/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py:86: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if token != u" " or token_start == 0:
INFO:tensorflow:Reading file: train.en
INFO:tensorflow:Trying min_count 500
INFO:tensorflow:Iteration 0
Traceback (most recent call last):
File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 378, in <module>
tf.app.run()
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 361, in main
training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
File "/tmp/anaconda2/envs/t2t2/bin/t2t-datagen", line 151, in <lambda>
lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/wmt.py", line 221, in ende_wordpiece_token_generator
tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/generator_utils.py", line 241, in get_or_generate_vocab
vocab_size, tokenizer.token_counts, 1, 1e3)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 343, in build_to_target_size
return bisect(min_val, max_val)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 329, in bisect
present_count, num_iterations)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 377, in build_from_token_counts
escaped_token = self._escape_token(token)
File "/tmp/anaconda2/envs/t2t2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 476, in _escape_token
ret += c
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
from tensor2tensor.
from tensor2tensor.
This is hopefully corrected in 1.0.11. please give it a try. I'm closing for now, please reopen if you still see the issue.
from tensor2tensor.
Hi,
Im using SubwordTextEncoder and I'm running into problems when I want to use t2t-decoder. I get the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 0: invalid start byte
I've already checked and I have the latest version of tensor2tensor. I was wondering i'm missing something, or if I should take into account something else when defining the problem to avoid getting a unicode error.
Thank you,
Daniela
from tensor2tensor.
Related Issues (20)
- Can't create experiment
- What is suitable version of tensorflow for current tensor2tensor version(1.15.7)?
- Error when deploy tensor2tensor model after training
- Moving MNIST frame prediction using SV2P or Emily
- Comparison of OpenNMT-tf and tensor2tensor
- Use of Layer Normalization
- AttributeError: 'DummyModule' object has no attribute 'load_checkpoint'
- Unable to instantiate problem instance when calling use_vocab_from_other_problem
- Error: AttributeError: module 'tensorflow.compat.v2.__internal__' has no attribute 'monitoring'
- Colab notebook breaking after TF1 support removed HOT 1
- Question about bleu evaluation HOT 1
- AttributeError: 'AdafactorOptimizer' object has no attribute 'get_gradients' HOT 1
- Aadm is slower than Adafactor
- AttributeError: module 'tensorflow' has no attribute 'flags'
- 'NoneType' object has no attribute 'copy' HOT 1
- Potential bug in timing embedding
- Tensor2Tensor Intro notebook giving errors while running cell 4
- Absolute Position Encoding:Why are the two tensors not alternately merged? HOT 2
- Migrating T2T to TF2
- RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensor2tensor.