Giter VIP home page Giter VIP logo

ttt's Introduction





TTT: Fine-tuning Transformers with TPUs or GPUs acceleration, written in Tensorflow2.0+

TTT or (Triple T) is short for a package for fine-tuning ๐Ÿค— Transformers with TPUs, written in Tensorflow2.0+. It is motivated to be completed due to bugs I found tricky to solve when using the xla library with PyTorch. As a newcomer to the TF world, I am humble to learn more from the community and hence it is open sourced here.

Update (2020-11-4):

Demo

Open In Colab

The following demonstrates the example of fine-tuning T5-small for sst2 (example_t5.py).

Features

  • Switch between TPUs and GPUs easily.
  • Stable training on TPUs.
  • Customize datasets or load from HF's datasets library.
  • Using pretrained tensorflow weights from the open-source library - ๐Ÿค— transformers.
  • Fine-tuning BERT-like transformers (DistilBert, ALBERT, Electra, RoBERTa) using keras High-level API.
  • Fine-tuning T5-like transformers using customize training loop, written in tensorflow2.0.
  • Supported tasks include single sequence-based classification task (both BERT-like models and T5 model), and translation, QA, or summarization (T5, as long as an example is characterized by: {"source","....","target","...."}

Quickstart

Install

pip install pytriplet

or if you want to get the latest updates:

git clone https://github.com/wangcongcong123/ttt.git
cd ttt
pip install -e .
  • make sure transformers>=3.1.0. If not, install via pip install transformers -U

update (2020-09-13): Example generation for T5 pretraining objective

from ttt import iid_denoise_text
text="ttt is short for a package for fine-tuning ๐Ÿค— Transformers with TPUs, written in Tensorflow2.0"
# here the text is split by space to tokens, you can use huggingface's T5Tokenizer to tokenize as well.
original, source, target=iid_denoise_text(text.split(), span_length=3, corrupt_ratio=0.25)

# original: ['ttt', 'is', 'short', 'for', 'a', 'package', 'for', 'fine-tuning', '๐Ÿค—', 'Transformers', 'with', 'TPUs,', 'written', 'in', 'Tensorflow2.0']
# source: ['ttt', '<extra_id_0>', 'a', 'package', 'for', 'fine-tuning', '๐Ÿค—', 'Transformers', 'with', '<extra_id_1>', '<extra_id_2>']
# target: ['<extra_id_0>', 'is', 'short', 'for', '<extra_id_1>', 'TPUs,', 'written', 'in', 'Tensorflow2.0']

Update (2020-10-15): Example of fine-tuning T5 for translation (example_trans_t5.py)

Fine-tuning: No boilerplate codes changed (the same as example_t5) except for the following args:

# any one from MODELS_SUPPORT (check:ttt/args.py)
args.model_select = "t5-small"
# the path to the translation dataset, each line represents an example in jsonl format like: {"target": "...", "source","..."}
# it will download automatically for the frist time from: https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
args.data_path = "data/wmt_en_ro"
# any one from TASKS_SUPPORT (check:ttt/args.py)
args.task = "translation"
args.max_src_length=128
args.max_tgt_length=128
args.source_field_name="source"
args.target_field_name="target"
args.eval_on="bleu" #this refers to sacrebleu as used in T5 paper

** On a TPUv3-8, the bleu score achieved by t5-base is 27.9 (very close to 28 as reported in the T5 paper), the fine-tuning args are here and training log is here.

Example of fine-tuning BERT for sst2 (example_bert.py)

from ttt import *

if __name__ == '__main__':
    args = get_args()
    # check what args are available
    logger.info(f"args: {json.dumps(args.__dict__, indent=2)}")
    ############### customize args
    # args.use_gpu = True
    args.use_tpu = True
    args.do_train = True
    args.use_tb = True
    # any one from MODELS_SUPPORT (check:ttt/args.py)
    args.model_select = "bert-base-uncased"
    # select a dataset following jsonl format, where text filed name is "text" and label field name is "label"
    args.data_path = "data/glue/sst2"
    # any one from TASKS_SUPPORT (check:ttt/args.py)
    args.task = "single-label-cls"
    args.log_steps = 400
    # any one from LR_SCHEDULER_SUPPORT (check:ttt/args.py)
    args.scheduler="warmuplinear"
    # set do_eval = False if your data does not contain a validation set. In that case, patience, and early_stop will be invalid
    args.do_eval = True
    args.tpu_address = "x.x.x.x" # replace with yours
    ############### end customize args
    # to have a sanity check for the args
    sanity_check(args)
    # seed everything, make deterministic
    set_seed(args.seed)
    tokenizer = get_tokenizer(args)
    inputs = get_inputs(tokenizer, args)
    model, _ = create_model(args, logger, get_model)
    # start training, here we keras high-level API
    training_history = model.fit(
        inputs["x_train"],
        inputs["y_train"],
        epochs=args.num_epochs_train,
        verbose=2,
        batch_size=args.per_device_train_batch_size*args.num_replicas_in_sync,
        callbacks=get_callbacks(args, inputs, logger, get_evaluator),
    )

So far the package has included the following supports for args.model_select, args.task and args.scheduler (args.py).

# these have been tested and work fine. more can be added to this list to test
MODELS_SUPPORT = ["distilbert-base-cased","bert-base-uncased", "bert-large-uncased", "google/electra-base-discriminator",
                  "google/electra-large-discriminator", "albert-base-v2", "roberta-base",
                  "t5-small","t5-base"]
# if using t5 models, the tasks has to be t2t* ones
TASKS_SUPPORT = ["single-label-cls", "t2t"]
# in the future, more schedulers will be added, such as warmupconstant, warmupcosine, etc.
LR_SCHEDULER_SUPPORT = ["warmuplinear", "warmupconstant", "constant"]

Command lines (suited in GCP)

This has to be run in Google GCP VM instance since the tpu_address is internal IP from Google (or change --use_tpu to use_gpu if you have enough GPUs). The flag --tpu_address should be replaced with yours. Notice: these runs are run with a set of "look-good" hyper-parameters but not exhaustively selected.

Experiment BERT on sst2 using TPUv2-8

C-1-1:

python3 run.py --model_select bert-base-uncased --data_path data/glue/sst2 --task single-label-cls --per_device_train_batch_size 8 --num_epochs_train 6 --max_seq_length 128 --lr 5e-5 --schedule warmuplinear --do_train --do_eval --do_test --use_tpu --tpu_address x.x.x.x

C-1-2:

python3 run.py --model_select bert-large-uncased --data_path data/glue/sst2 --task single-label-cls --per_device_train_batch_size 8 --num_epochs_train 6 --max_seq_length 128 --lr 5e-5 --schedule warmuplinear --do_train --do_eval --do_test --use_tpu --tpu_address x.x.x.x

** In addition, experiments on larger batch sizes were also conducted on TPUv2-8. For example, when per_device_train_batch_size is 128 (batch size=8*128=1024), this first epoch takes around ~1 minute and the rest of each takes just ~15 seconds! That is fast but the sst2 accuracy goes down significantly.

Results

bert-base-uncased (110M) bert-large-uncased (340M)
here BERT paper reproduction (here) command time spent on a n1-standard-8 * here BERT paper reproduction (here) command time spent on a n1-standard-8 *
sst2 (test set, acc.) 93.36 93.5 C-1-1 16 minutes 94.45 94.9 C-1-2 37 minutes
  • *refer to the estimated time including training, every 400 steps evaluation and evaluation on testing.
  • Looks good, the results are close to the original reported results.

Experiment T5 on sst2 using TPUv2-8

C-2-1:

python3 run.py --model_select t5-small --data_path data/glue/sst2 --task t2t --per_device_train_batch_size 8 --num_epochs_train 6 --max_seq_length 128 --lr 5e-5 --schedule warmuplinear --do_train --do_eval --do_test --use_tpu --tpu_address x.x.x.x

C-2-2:

python3 run.py --model_select t5-base --data_path data/glue/sst2 --task t2t --per_device_train_batch_size 8 --num_epochs_train 6 --max_seq_length 128 --lr 5e-5 --schedule warmuplinear --do_train --do_eval --do_test --use_tpu --tpu_address x.x.x.x

C-2-3:

python3 run.py --model_select t5-large --data_path data/glue/sst2 --task t2t --per_device_train_batch_size 2 --eval_batch_size 8 --num_epochs_train 6 --max_seq_length 128 --lr 5e-5 --schedule warmuplinear --do_train --do_eval --do_test --use_tpu --tpu_address x.x.x.x 

** failed (out-of-memory) although per_device_train_batch_size=2. Does a TPUv2-8 not have enough memory to fine-tune a t5-large model? Looking for solutions to fine-tune t5-large. Update: Later on, I am lucky to get a TPUv3-8 (128G), so it is run successfully.

Results

t5-small (60M) t5-base (220M) t5-large (770 M)
here T5 paper reproduction (here) command time spent on a n1-standard-8 * here T5 paper reproduction (here) command time spent on a n1-standard-8 * here T5 paper reproduction (here) command time spent on a n1-standard-8 **
sst2 (test set, acc.) 90.12 91.8 C-2-1 20 minutes 94.18 95.2 C-2-2 36 minutes 95.77 96.3 C-2-3 4.5 hours
  • *refer to the estimated time including training, every 400 steps evaluation and evaluation on testing.
  • **the same but with a TPUv3-8 and smaller batch size (see command C-2-3).
  • Looks not bad, the results are a bit close to the original reported results.

Contributions

  • Contributions are welcome.

Todo ideas

  • To include more different language tasks, such as sequence-pair based classificaton, t5 toy pretraining, etc.
  • LR scheduler so far include "warmuplinear", "warmupconstant", "constant", "constantlinear". The plan is to implement all these that are available in optimizer_schedules.
  • Now all fine-tuning use Adam as the default optimizer. The plan is to implement others such as AdaFactor, etc.
  • Optimizations include: TF clip_grad_norm as used in PyTroch fine-tuning, AMP training, etc.

Last

I have been looking for PyTorch alternatives that can help train large models with Google's TPUs in Google's GCP VM instance env. Although the xla lib seems good, I gave it up due to some bugs I found hard to fix. Something like "process terminated with SIGKILL" confused me a lot, and took me loads of time, and eventually fail to solve after searching all kinds of answers online (ref1, ref2, the community looks not that active in this field). Later on, some clues online tell me this problem is something related to memory overloading and I expect the xla lib will be more stable release in the future. It works well when being experimented with the MNIST example provided in Google's official website but comes up the "memory" problem when tested on big models like transformers (I did not make this ๐Ÿค— transformers' xla_spawn.py run successful either).

Hence, I shift to learn Tensorflow as a newcomer from PyTorch to make my life easy whenever I feel needed to train a model on TPUs. Thankfully, Tensorflow-2.0 makes this shift not that difficult although some complains on it always go on. After around three days of researching and coding, I end up with this simple package. This package is made public-available in hope of helping whoever has the same encountering as me. Most of the training code (so-called boilerplate codes) flow in this package looks a style of PyTorch due to my old habit. Hopefully, this makes it easy to know Tensorflow-2.0 when you are from PyTorch and you need TPUs.

Ack.

Thanks for Google's TFRC Program giving TPUs credits to make this possible.

ttt's People

Contributors

wangcongcong123 avatar wangcongcongcc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ttt's Issues

Demo colab notebook error: "required positional argument: 'logger'"

I noticed this error first in my own notebook, then was able to replicate in the colab notebook @ https://colab.research.google.com/github/wangcongcong123/ttt/blob/master/ttt_notebook.ipynb

Following error is received when running:

     40   model, strategy = create_model(args, logger, get_model)
     41   # start training, here we customize T2TTrainer to get more control and flexibility
---> 42   trainer = T2TTrainer(args)
     43   trainer.train(model, strategy, tokenizer, inputs)

TypeError: __init__() missing 1 required positional argument: 'logger'

Demo colab notebook error: "cannot import name 'get_config' from 'tensorflow.python.eager.contex'"

The error occured as below when I tried ttt demo notebook at Colab
and execute the [ttt import *] cell.


/usr/local/lib/python3.7/dist-packages/keras/backend.py in ()
35 from tensorflow.python.distribute import distribute_coordinator as dc
36 from tensorflow.python.distribute import distribute_coordinator_context as dc_context
---> 37 from tensorflow.python.eager.context import get_config
38 from tensorflow.python.framework import config
39 from keras import backend_config

ImportError: cannot import name 'get_config' from 'tensorflow.python.eager.context' (/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py)

Is there a good way to convert CSV to JSON for this?

I've fine-tuned up to T5-large using CSV and SimpleT5 or Happy Transformer, but I really want to fine-tune T5 3B/XL. Both of those support CSV datasets, and I've had no problems. However, neither of those currently support TPU.

So far, yours seems like the simplest T5 fine-tuning I can find that supports TPU, but I can't seem to get my dataset ready no matter how I've tried to convert it to JSON. I've tried over 10 different things.

The errors I keep getting are all like this:

JSONDecodeError: Expecting value: line 1 column 2 (char 1)

Any advice? Thanks so much! Good work!

AttributeError: 'TPUStrategyV2' object has no attribute 'experimental_run_v2'

I'm trying to fine-tune in Colab, and I keep getting this error, but I'm not sure how to fix/get around it.

Any advice?

I believe it's related to this:

per_replica_losses = strategy.experimental_run_v2(train_step, args=(x_train, y_train,))

This is the full error output:

2022-03-13 04:56:03.212 INFO - run: args: {}
Output directory (tmp/t5-small_t2t_content-train) already exists and is not empty, you wanna remove it before start training? (y/n)y
2022-03-13 04:57:45.139 INFO inputs - get_with_prepare_func: reading cached data from /content/train/t5-small-data.pkl
2022-03-13 04:57:45.142 WARNING inputs - get_with_prepare_func: if you changed the max_seq_length/max_src_length/max_tgt_length, this may not correctly loaded, since the /content/train/t5-small-data.pkl is pickled based on first time loading
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
2022-03-13 04:57:45.222 INFO tpu_strategy_util - initialize_tpu_system: Deallocate tpu buffers before initializing tpu system.
WARNING:tensorflow:TPU system grpc://10.77.192.66 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
2022-03-13 04:57:45.917 WARNING tpu_strategy_util - initialize_tpu_system: TPU system grpc://10.77.192.66 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
INFO:tensorflow:Initializing the TPU system: grpc://10.77.192.66
2022-03-13 04:57:45.926 INFO tpu_strategy_util - initialize_tpu_system: Initializing the TPU system: grpc://10.77.192.66
INFO:tensorflow:Finished initializing TPU system.
2022-03-13 04:57:53.909 INFO tpu_strategy_util - initialize_tpu_system: Finished initializing TPU system.
2022-03-13 04:57:53.914 INFO - create_model: All TPU devices:
2022-03-13 04:57:53.916 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU')
2022-03-13 04:57:53.920 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU')
2022-03-13 04:57:53.922 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')
2022-03-13 04:57:53.925 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')
2022-03-13 04:57:53.928 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU')
2022-03-13 04:57:53.930 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU')
2022-03-13 04:57:53.933 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU')
2022-03-13 04:57:53.935 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')
INFO:tensorflow:Found TPU system:
2022-03-13 04:57:53.938 INFO tpu_system_metadata - _query_tpu_system_metadata: Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
2022-03-13 04:57:53.941 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
2022-03-13 04:57:53.945 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
2022-03-13 04:57:53.948 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
2022-03-13 04:57:53.952 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
2022-03-13 04:57:53.958 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
2022-03-13 04:57:53.961 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
2022-03-13 04:57:53.965 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
2022-03-13 04:57:53.968 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
2022-03-13 04:57:53.972 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
2022-03-13 04:57:53.975 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
2022-03-13 04:57:53.979 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
2022-03-13 04:57:53.985 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
2022-03-13 04:57:53.988 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
2022-03-13 04:57:53.992 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
2022-03-13 04:57:53.995 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Model: "tft5_for_conditional_generation_2"


Layer (type) Output Shape Param #

shared (TFSharedEmbeddings) multiple 16449536

encoder (TFT5MainLayer) multiple 18881280

decoder (TFT5MainLayer) multiple 25175808

=================================================================
Total params: 60,506,624
Trainable params: 60,506,624
Non-trainable params: 0


2022-03-13 04:58:18.877 INFO - create_model: None
/content/ttt/ttt/t2t_trainer.py:56: FutureWarning: Passing inputs as a keyword argument is deprecated. Use train_dataset and eval_dataset instead.
FutureWarning,
2022-03-13 04:58:18.946 INFO t2t_trainer - train: set random seed for everything with 122
2022-03-13 04:58:19.412 INFO utils - write_args_enhance: {
"source_field_name": "source",
"target_field_name": "target",
"use_tpu": true,
"do_train": true,
"use_tb": true,
"model_select": "t5-small",
"data_path": "/content/train",
"task": "t2t",
"log_steps": 400,
"scheduler": "warmuplinear",
"do_eval": false,
"tpu_address": "10.77.192.66",
"output_folder": "t5-small_t2t_content-train",
"output_path": "tmp/t5-small_t2t_content-train",
"is_pretrain": false,
"is_load_from_data_cache": true,
"data_cache_path": "/content/train/t5-small-data.pkl",
"source_sequence_length": 111,
"target_sequence_length": 20,
"num_replicas_in_sync": 8,
"best": -Infinity,
"warmup_steps": 233
}
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adam.py:105: UserWarning: The lr argument is deprecated, use learning_rate instead.
super(Adam, self).init(name, **kwargs)
epochs: 0%| | 0/6 [00:00<?, ?it/s]2022-03-13 04:58:19.433 INFO t2t_trainer - train: start training at epoch = 0
2022-03-13 04:58:19.440 INFO t2t_trainer - train: global train batch size = 64
2022-03-13 04:58:19.442 INFO t2t_trainer - train: using learning rate scheduler: warmuplinear
2022-03-13 04:58:19.446 INFO t2t_trainer - train: num_train_examples: 24867, total_steps: 2334, steps_per_epoch: 389
2022-03-13 04:58:19.454 INFO t2t_trainer - train: warmup_steps:233

0%| | 0/389 [00:00<?, ?it/s]
epochs: 0%| | 0/6 [00:00<?, ?it/s]

AttributeError Traceback (most recent call last)
in ()
----> 1 run()

3 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in autograph_handler(*args, **kwargs)
1145 except Exception as e: # pylint:disable=broad-except
1146 if hasattr(e, "ag_error_metadata"):
-> 1147 raise e.ag_error_metadata.to_exception(e)
1148 else:
1149 raise

AttributeError: in user code:

File "/content/ttt/ttt/t2t_trainer.py", line 147, in distributed_train_step  *
    per_replica_losses = strategy.experimental_run_v2(train_step, args=(x_train, y_train,))

AttributeError: 'TPUStrategyV2' object has no attribute 'experimental_run_v2'

TPU error in Google GCP - fixed

The latest commit solved the following bug:

Instructions for updating:
renamed to `run`
  0%|                                                                                              | 0/16 [00:37<?, ?it/s]epochs:   0%|                                                                                       | 0/6 [00:37<?, ?it/s]Traceback (most recent call last):
  File "example_t5.py", line 47, in <module>
    trainer.train(model, strategy, tokenizer, inputs)
  File "/root/ttt/ttt/t2t_trainer.py", line 227, in train
    epoch_total_loss += loss.numpy()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1063, in numpy
    maybe_arr = self._numpy()  # pylint: disable=protected-access
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1031, in _numpy
    six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnavailableError: Socket closed
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
    context.async_wait()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
    context().sync_executors()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found.
  (0) Unavailable: Socket closed
  (1) Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
0 successful operations.
0 derived errors ignored.
2020-10-23 19:51:06.239763: W    3876 ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1603482666.236322988","description":"Error received from peer ipv4:x.x.x.x:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2020-10-23 19:51:06.241849: W    3781 tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?

Summarization/Translation support

Cool lib!

I was wondering if it's possible tested to run translation finetuning with this library, for example on the wmt_en_ro data.

Download Instructions

wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz

Thanks and great work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.