nshepperd / gpt-2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openai/gpt-2

1.1K 1.1K 445.0 4.45 MB

Code for the paper "Language Models are Unsupervised Multitask Learners"

License: Other

Python 62.34% Haskell 37.66%

gpt-2's People

Contributors

Stargazers

Watchers

Forkers

stefaj jonathanfly papayastehskeletor phkn damlabresources tenoke tux3 manyangledone oderdene bin2000 dte chanr-analytics cameroncruz loretoparisi lostmsu rahulkhairnarr brettp olivandro t04glovern djfunksalot dadelani sam-ai okz12 rkfg ucdenver-ccp agoramahub leotam leoxzhao zfake lgstd sbc ambiguouserror ceshine jonadsimon jwhite2a madscience101 zeroows eberlitz nickmvincent browndsi sraboy kpiyush1 anishthite pathway h3ap anshengqiang brianrosamilia yet-another-account greatblueheron naveen-dodda cyberninja22 claudecoulombe scarescrow cdleong red8top thimm kousun12 jjfiv j55liang eric-et rossgoodwin aliengirlliv trungtin matthewjwall metalpole atanida ffaltings madebyollin jim-martin ricklentz akshr je-ny hossamhasanin yrahul3910 nicolas-ivanov patricebechard ncoop57 bertilnilsson dsj9 rusiaaman rajeshdiwakar hcshin90 imranrolo devinoue mgrankin superintelligentcookies gliptic csvoss ncoronges rogervaas alludedcrabb byest marktgodfrey benjaminchew robertisandor eriksik2 mowillia gigawhitlocks michelleful rochan-a

gpt-2's Issues

Failed to interpret file %s as a pickle

Training on distributed machine is slow. Using 8 Nvidia V100.

I'm using aws p3dn.24xlarge to train my data on 8 Nvidia V100 GPU's but the training seem slower than 1 GPU.

This is the config in train-horovod.py:

def train_main(dataset,
               model_name='345M',
               seed=None,
               batch_size=1,
               sample_length=1023,
               sample_num=1,
               sample_every=500,
               run_name='run1',
               restore_from='latest',
               save_every=1000,
               combine=50000):

That's the output, as you can see it takes a long time for each step. Trying to increase the batch size results in OOM.

[1 | 13.96] loss=3.12 avg=3.12
[2 | 16.30] loss=22.49 avg=12.85
[3 | 18.51] loss=8.58 avg=11.41
[4 | 20.70] loss=7.58 avg=10.44
[5 | 23.08] loss=7.59 avg=9.86
[6 | 25.48] loss=6.96 avg=9.36
[7 | 27.52] loss=6.34 avg=8.92
[8 | 29.85] loss=6.26 avg=8.58
[9 | 32.30] loss=5.86 avg=8.26
[10 | 34.31] loss=6.00 avg=8.02
[11 | 36.61] loss=5.78 avg=7.81
[12 | 38.94] loss=5.53 avg=7.61
[13 | 41.25] loss=5.32 avg=7.42
[14 | 43.69] loss=5.06 avg=7.24
[15 | 45.94] loss=6.06 avg=7.16
[16 | 48.34] loss=4.94 avg=7.01
[17 | 50.74] loss=5.16 avg=6.89
[18 | 53.10] loss=4.73 avg=6.76
[19 | 55.21] loss=4.54 avg=6.63
[20 | 57.56] loss=5.09 avg=6.55
[21 | 59.75] loss=4.66 avg=6.45
[22 | 62.22] loss=4.44 avg=6.35
[23 | 64.45] loss=4.40 avg=6.25
[24 | 66.68] loss=3.91 avg=6.14
[25 | 69.04] loss=3.79 avg=6.04

What is the minimum size of GPU I need to set batch_size more than 1 to train 345M model using train.py?

I am using ml.p3.2xlarge instance on AWS with one 16 GB V100 GPU and tried to train 345 model with batch_size 2 and it gets OOM error. It works for batch_size 1 though.
I am thinking of using batch_size 2 to 8. What size of GPU do I need to make this happen? If anyone has experienced this situation, sharing it would be helpful.
I am using this command to train it.

python train.py --dataset Dataset/data.npz --sample_every 10 --sample_num 3 --batch_size 1 --learning_rate 0.0001 --model_name 345M

OOM With Gradient Checkpointing on 1080 Ti

Attempting to seed 345M with a pre-encoded dataset - I have a 1080 Ti, but even after checking that memory saving gradients is enabled, I run into an OOM exception after a few seconds of training:

2019-05-12 14:33:35.460258: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 230633472 totalling 219.95MiB
2019-05-12 14:33:35.463612: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 8.58GiB
2019-05-12 14:33:35.467832: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit:                  9218918974
InUse:                  9217796864
MaxInUse:               9217796864
NumAllocs:                    3470
MaxAllocSize:            329412864

2019-05-12 14:33:35.475291: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2019-05-12 14:33:35.482129: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,16,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,16,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/h23/attn/truediv_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 268, in <module>
    main()
  File "train.py", line 246, in main
    feed_dict={context: sample_batch()})
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "E:\Anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,16,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node model/h23/attn/truediv_1 (defined at e:\NeuralNets\gpt-2-seeded\src\memory_saving_gradients.py:204) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'model/h23/attn/truediv_1', defined at:
  File "train.py", line 268, in <module>
    main()
  File "train.py", line 116, in main
    opt_grads = memory_saving_gradients.gradients(loss, train_vars)
  File "e:\NeuralNets\gpt-2-seeded\src\memory_saving_gradients.py", line 204, in gradients
    copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})
  File "E:\Anaconda\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 673, in copy_with_input_replacements
    sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)
  File "E:\Anaconda\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 453, in __call__
    self._copy_ops(info)
  File "E:\Anaconda\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 467, in _copy_ops
    op_, op_outputs_ = self.transform_op_handler(info, op, new_inputs)
  File "E:\Anaconda\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 177, in copy_op_handler
    [], input_types_, None, op_def_)
  File "E:\Anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,16,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node model/h23/attn/truediv_1 (defined at e:\NeuralNets\gpt-2-seeded\src\memory_saving_gradients.py:204) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

What am I missing? Is there any way to reduce memory usage even further?

How to ACTUALLY train 345M on Multiple GPU using train-horovod.py?

What am I doing wrong? Is it impossible to train the 345M model on multiple GPUs? Or is my GPUs are not enough? If it's the case, what GPU size and how many GPUs would work?

Is this the right process?

I am using an ml.p3.8xlarge instance on AWS with 4x v100 GPUs (16 GB each). I am trying to run train-horovod.py to train the 345M model on these 4 GPUs. I am running this command -

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/data_encoded.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

I am using batch_size == 1. Still, I am getting this error repeatedly.

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Mean/_5215]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
...
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node strided_slice_1 (defined at /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
........
........

I also tried installing CUDNN as mentioned here in issue #8 using this command -

conda install -c anaconda cudnn --yes

After running the nvidia-smi command, this is the state of the GPU just before facing the error.

Sat Jun 20 09:48:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    65W / 300W |  15602MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   49C    P0    66W / 300W |  15626MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   51C    P0    72W / 300W |  15626MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   49C    P0    72W / 300W |  15626MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    115021      C   python                                     15583MiB |
|    1    115022      C   python                                     15583MiB |
|    2    115023      C   python                                     15583MiB |
|    3    115024      C   python                                     15583MiB |
+-----------------------------------------------------------------------------+

If anyone has solved this problem or faced it and shares his/her experience, it would help a great deal. Thanks.

Is possible to use this train script for another language dataset? In order to train it from start in a new language.

"past" is not used in training

Nice implementation. I'm wondering why "past" tensor is not used during training? I feel like the "past" should be also used if the sentences in the paragraph are coherent during training. (Then sampling chunks may not be used for training data)

Sample Length

Has anyone found a way to significantly increase the sample --length without raising an exception?

Error on running Encode.py: "ModuleNotFoundError: No module named 'regex'"

I started a new installation of GPT-2 using this fork of the project, following the directions in https://www.youtube.com/watch?v=4iK-IuvatxI (for training) and https://lambdalabs.com/blog/run-openais-new-gpt-2-text-generator-code-with-your-gpu/ (for setting up the environment). This means I'm following the video instructions for training, but running everything using venv-gpt-2.

When I try
(venv-gpt-2) me@mypc:~/Finetuning/gpt-2/src$ sudo python3 encode.py training.txt training.npz

I'm getting

Traceback (most recent call last): File "encode.py", line 9, in <module> import encoder File "/home/samy/Finetuning/gpt-2/src/encoder.py", line 5, in <module> import regex as re ModuleNotFoundError: No module named 'regex'

However the package is installed, since running
(venv-gpt-2) me@mypc:~/Finetuning/gpt-2/src$ python3 -m pip install regex
returns
Requirement already satisfied: regex in /home/samy/venv-gpt-666/lib/python3.6/site-packages

The packages were installed using
(venv-gpt-2) me@mypc:~/Finetuning/gpt-2$ pip install -r requirements.txt

I can also confirm that I can run GPT-2 normally by executing interactive_conditional_samples.py on venv-gpt-2, as instructed in https://lambdalabs.com/blog/run-openais-new-gpt-2-text-generator-code-with-your-gpu/. Everything runs normally there, I get a prompt, I can give the algorithm a new seed and generate text. The only issue would be on training it with new text.

All help to solve this issue will be deeply appreciated.

module 'tensorflow' has no attribute 'sort'

Just reporting that using the gpu Dockerfile I had an error:

Traceback (most recent call last):
File "train.py", line 305, in <module>
main()
File "train.py", line 126, in main
top_p=top_p)
File "/root/src/sample.py", line 93, in sample_sequence
back_prop=False,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3291, in while_loop
return_same_structure)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3004, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2939, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3260, in <lambda>
body = lambda i, lv: (i + 1, orig_body(*lv))
File "/root/src/sample.py", line 67, in body
logits = top_p_logits(logits, p=top_p)
File "/root/src/sample.py", line 27, in top_p_logits
logits_sort = tf.sort(logits, direction='DESCENDING')
AttributeError: module 'tensorflow' has no attribute 'sort'

solved with:

logits_sort = tf.contrib.framework.sort(logits, direction='DESCENDING')

Question about training with small dataset entries

hello, as the title suggest, i'm trying to train the model on a set of very small texts (these are actually messages from a chat, just so you can understand the type of content), and i'm running in two main problems. first (not sure if directly related to the size of the entries) i generated my dataset like so
message <|endoftext|> message
this way, a good part of my dataset, is actually made up of <|endoftext|>, i would have expected that since this is a delimiter, it would be ignored by the training, but it seems this is not the case since my samples are full of it

======== SAMPLE 1 ========
* <|endoftext|> 
ma in realtà lei è poco dai
 <|endoftext|> 
che fine ha fatto anche quell'altro non farei mai nascosti
 <|endoftext|> 
eh ma cosa
 <|endoftext|>

also, (i might be wrong here, since i don't fully understand how the model works) i would have expected that since i;m training with very short texts, which are about 12-15 words long, the generated samples would be of a similar size, instead, i get full blown 50 lines samples (excluding the <|endoftext|>), is that a "limitation of the model"? are the separators not working? or is there a way to force shorter samples?
i should also note that as of now, on the last dataset i'm training, i'm about 70 000 steps maybe it is too few? also as you can see from the sample, i'm actually not training english but italian, could that be the cause?

Train.py issues windows 10

So I admittedly am new to deep learning, and have a pretty basic understanding of python.
I've prepped my files and am trying to train my model but I've been having a fair amount of issues with Tensorflow- most of which I've solved. Currently im running tensorflow 1.14 . Whenever I input "python train.py --dataset adorno.txt" I get (Attached).
Im using python 3.7, CudNN 10.1, and pip3
C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\Users\alexv\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Traceback (most recent call last): File "train.py", line 17, in <module> import memory_saving_gradients File "C:\Users\alexv\Downloads\gpt-2-master\gpt-2-master\src\memory_saving_gradients.py", line 1, in <module> from toposort import toposort ModuleNotFoundError: No module named 'toposort'
Errrrors for days.txt

How to get the model embeddings?

I'm in the process of training GPT-2 and want to save and check out word embeddings that it produced every n's iteration. But I don't know how to obtain this embeddings as the model defined not as a class but as a function and I don't know where the embeddings are stored...

dataset

just to clarify how do you make a dataset for training?

Finetuning on the Full Model - OOM 1558M

Hello,

I've been using the only-train-x-layers work around for over a month on the 774 model. (Even working on Colab K80) with 12gb Ram, and it's been great!

Problem is, I'm now looking into getting the 1558M to work, using a google GCE 30gb 8CPU, 1x Nvidia V100 GPU 16gb.

Reaching OOM @ 15gb.

Now, I'm looking for advice on this, having experienced the CTRL full model on 2xP100's and reaching the same OOM, also on the v100.

I haven't tried 2xV100's I will.

But has anyone else, for example you nshepperd got the full to finetune?

Unicode characters each considered as a token

Hi,

I have retrained the 117M parameter model with 6GB Bengali text dataset. Setting the '--length' parameter to 100 in interactive conditional sample generation is supposed to create 100 tokens, but in my case it is generating 100 characters. This makes me believe that for unicode characters the model is considering each character as a token. How do I make it consider each word as a token?

Thanks in advance.

generate_samples() code from gpt2 train.py -- InvalidArgumentError

I'm looking at a gpt2 tensorflow github repository.

I would like to modify the code in the train.py file to fine tune gpt2. I want to create samples from the model with a paragraph of my own choosing as the context. For example, if I load my paragraph into the data_sampler, using Sample(), and then change the call to data_sampler.sample(1) to a number higher than 1 (5 for example) I get a error.

        def generate_samples():
        print('Generating samples...')
        context_tokens = data_sampler.sample(5) ## <<--- change here!!
        all_text = []
        index = 0
        while index < args.sample_num:
            out = sess.run(
                tf_sample,
                feed_dict={context: args.batch_size * [context_tokens]})
            for i in range(min(args.sample_num - index, args.batch_size)):
                text = enc.decode(out[i])
                text = '======== SAMPLE {} ========\n{}\n'.format(
                    index + 1, text)
                all_text.append(text)
                index += 1
        print(text)
        maketree(os.path.join(SAMPLE_DIR, args.run_name))
        with open(
                os.path.join(SAMPLE_DIR, args.run_name,
                             'samples-{}').format(counter), 'w') as fp:
            fp.write('\n'.join(all_text))

I get the following error.

InvalidArgumentError (see above for traceback): indices[0,0] = 1024 is not in [0, 1024)
	 [[node sample_sequence/while/model/GatherV2_1 (defined at ../model/tf_gpt2/src/model.py:157) ]]

The full error is below. NOTE: I have copied the original train.py file and renamed it tf_gpt2_train_babi.py. My ultimate goal is to try the babi synthetic data set with gpt2, which explains the name.

Traceback (most recent call last):
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)
[[{{node sample_sequence/while/model/GatherV2_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tf_gpt2_train_babi.py", line 370, in
main()
File "./tf_gpt2_train_babi.py", line 334, in main
generate_samples()
File "./tf_gpt2_train_babi.py", line 268, in generate_samples
feed_dict={context: args.batch_size * [context_tokens]})
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)
[[node sample_sequence/while/model/GatherV2_1 (defined at ../model/tf_gpt2/src/model.py:157) ]]

Caused by op 'sample_sequence/while/model/GatherV2_1', defined at:
File "./tf_gpt2_train_babi.py", line 370, in
main()
File "./tf_gpt2_train_babi.py", line 159, in main
top_k=40)
File "../model/tf_gpt2/src/sample.py", line 76, in sample_sequence
back_prop=False,
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in
body = lambda i, lv: (i + 1, orig_body(*lv))
File "../model/tf_gpt2/src/sample.py", line 50, in body
next_outputs = step(hparams, prev[:, tf.newaxis], past=past)
File "../model/tf_gpt2/src/sample.py", line 33, in step
lm_output = model.model(hparams=hparams, X=tokens, past=past, reuse=tf.AUTO_REUSE)
File "../model/tf_gpt2/src/model.py", line 157, in model
h = tf.gather(wte, X) + tf.gather(wpe, positions_for(X, past_length))
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 3273, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3748, in gather_v2
"GatherV2", params=params, indices=indices, axis=axis, name=name)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/dave/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[0,0] = 1024 is not in [0, 1024)
[[node sample_sequence/while/model/GatherV2_1 (defined at ../model/tf_gpt2/src/model.py:157) ]]

Consued about vocab and encoder

I'm reading the source code. And I have two questions about vocab and encoder. Please help me with that. Thank you in advance.

For vocab.bpe, I take the second row (Ġ t) for example. But I found "Ġ" appears in many rows(for example the third row). So why isn't it one-to-one correspondence?
Are the items in encoder.json the subtokens from BPE? I take "\u0120regress" for example. Why does "\u0120" appear here?

How can I use 1558m data or 775 data instead of using 177m data to train my own model?

About Perplexity

Hi. I'm D. Y. Kim and NLP developer in Korea.
First of all, Thank you so much for your project.
I got a lot of help from your project to build Korean GPT-2 model.

I have one question about metric such as Perplexity.
In open-ai's paper, they use perplexity to evalutate their model.
But I can't search perplexity in your code.
In your code, there are caculations which one is v_loss and other is avg_loss.
I guess that avg_loss or v_val_loss (in validation) are alternative metric.
Is it right?

If not, is there any method to calculate perplexity?

refine-tuning by GPU generating repeated words

When using gpu to refine-tuning in simple there are always repeated same words like" jump, jump,jump,jump,jump,jump, final,final,final,final,final" Anyone knows how to fix this?

Intermediate Layer Output

Similar to the issue I posted here: openai#148
-- Is it possible to use the intermediate layer outputs and generate text ignoring the layers on top?Basically, I want to check quality of generations as we keep on adding more layers. What modifications in the src/sample.py script would I have to make for the same? Thanks.

Image GPT Training

Hey!
I was looking into re-training the pre-trained models in Image GPT.
Since the project was forked from GPT2, I was wondering if your train.py might work for Image GPT. (Since I can't find any documentation in the Image GPT repo for training.)

Anyone have any insights?
Thanks in advance! :)

Error when calculating the validation loss - indices[0,1200] = 1200 is not in [0, 1024)

Hi! Thanks for your excellent repo and instructions. The model was trained well but I got an issue when I was trying to calculate validation loss on my own validation data when I choose the val_batch_count=4000.

The Error is as follows:

Loading checkpoint models/medium/model-570000
Loading dataset...
Training...
Calculating validation loss...
 37%|██████████████▎                        | 1471/4000 [13:49<20:24,  2.07it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,1200] = 1200 is not in [0, 1024)
	 [[{{node model_1/GatherV2_1}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](model/wpe/read, model_1/Tile, model_1/h23/attn/range/start)]]

I found a similar issue here: minimaxir/gpt-2-simple#38
It is said this may caused by long prefix but not sure how to solve that.

Anyone may know how?

774M Model running out of memory

2019-08-20 22:56:50.301264: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e6a00 next 222 of size 256                                                                                                                                      [71/1832]
2019-08-20 22:56:50.301278: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e6b00 next 224 of size 5120     
2019-08-20 22:56:50.301307: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e7f00 next 225 of size 256                      
2019-08-20 22:56:50.301330: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e8000 next 227 of size 20480                     
2019-08-20 22:56:50.301339: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8ed000 next 228 of size 5120     
2019-08-20 22:56:50.301347: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8ee400 next 229 of size 5120     
2019-08-20 22:56:50.301355: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8ef800 next 230 of size 5120     
2019-08-20 22:56:50.301381: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f0c00 next 231 of size 5120     
2019-08-20 22:56:50.301387: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f2000 next 232 of size 5120     
2019-08-20 22:56:50.301399: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f3400 next 233 of size 5120                     
2019-08-20 22:56:50.301408: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f4800 next 234 of size 256      
2019-08-20 22:56:50.301416: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f4900 next 236 of size 5120     
2019-08-20 22:56:50.301425: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f5d00 next 237 of size 256      
2019-08-20 22:56:50.301433: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f5e00 next 238 of size 15360    
2019-08-20 22:56:50.301442: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f9a00 next 239 of size 256                                                                            
2019-08-20 22:56:50.301450: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f9b00 next 242 of size 5120     
2019-08-20 22:56:50.301459: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8faf00 next 243 of size 256                         
2019-08-20 22:56:50.312661: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8fb000 next 18446744073709551615 of size 20480   
2019-08-20 22:56:50.312681: I tensorflow/core/common_runtime/bfc_allocator.cc:809]      Summary of in-use Chunks by size:                           
2019-08-20 22:56:50.312699: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 411 Chunks of size 256 totalling 102.8KiB         
2019-08-20 22:56:50.312710: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 1280 totalling 1.2KiB            
2019-08-20 22:56:50.312720: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 16 Chunks of size 4096 totalling 64.0KiB                          
2019-08-20 22:56:50.312732: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 940 Chunks of size 5120 totalling 4.59MiB         
2019-08-20 22:56:50.312741: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 144 Chunks of size 15360 totalling 2.11MiB                        
2019-08-20 22:56:50.312750: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 144 Chunks of size 20480 totalling 2.81MiB        
2019-08-20 22:56:50.312760: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 8 Chunks of size 81920 totalling 640.0KiB                         
2019-08-20 22:56:50.312770: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 8 Chunks of size 4194304 totalling 32.00MiB       
2019-08-20 22:56:50.312779: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 79 Chunks of size 5242880 totalling 395.00MiB     
2019-08-20 22:56:50.312789: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 5246976 totalling 5.00MiB        
2019-08-20 22:56:50.312798: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 143 Chunks of size 6553600 totalling 893.75MiB    
2019-08-20 22:56:50.312808: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 9134592 totalling 8.71MiB        
2019-08-20 22:56:50.312821: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 141 Chunks of size 19660800 totalling 2.58GiB                     
2019-08-20 22:56:50.312831: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 28 Chunks of size 20971520 totalling 560.00MiB                   
2019-08-20 22:56:50.312841: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 22806528 totalling 21.75MiB                                                                                         
2019-08-20 22:56:50.312850: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 283 Chunks of size 26214400 totalling 6.91GiB     
2019-08-20 22:56:50.312860: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 32243712 totalling 30.75MiB                      
2019-08-20 22:56:50.312869: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 32505856 totalling 31.00MiB      
2019-08-20 22:56:50.312879: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 33554432 totalling 64.00MiB                      
2019-08-20 22:56:50.312888: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 7 Chunks of size 37748736 totalling 252.00MiB                     
2019-08-20 22:56:50.312898: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 38061056 totalling 36.30MiB                      
2019-08-20 22:56:50.312909: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 40894464 totalling 39.00MiB      
2019-08-20 22:56:50.312918: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 50394368 totalling 48.06MiB                                                        
2019-08-20 22:56:50.312927: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 28 Chunks of size 83886080 totalling 2.19GiB                                                                                                                                               
2019-08-20 22:56:50.312937: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 141774848 totalling 135.21MiB    
2019-08-20 22:56:50.312947: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 142606336 totalling 136.00MiB    
2019-08-20 22:56:50.312956: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 257315840 totalling 245.40MiB    
2019-08-20 22:56:50.312966: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 14.55GiB              
2019-08-20 22:56:50.312975: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15652398592 memory_limit_: 15652398695 available bytes: 103 curr_region_allocation_bytes_: 17179869184
2019-08-20 22:56:50.312997: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:                                            
Limit:                 15652398695                                                                                                   
InUse:                 15626903296                                                                                                   
MaxInUse:              15647874816                                                                                                                                                                         
NumAllocs:                    4685                                                                                                                   
MaxAllocSize:            257315840                                                                                                                      
                                                                                                                                     
2019-08-20 22:56:50.313127: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2019-08-20 22:56:50.313180: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[1,20,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocato
r GPU_0_bfc                                                                                                                          
Traceback (most recent call last):                                                                                                   
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call                            
    return fn(*args)                                                                                                                                  
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn            
    options, feed_dict, fetch_list, target_list, run_metadata)                                                                       
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)                                                                                                                    
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,20,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node gradients/model/h18/attn/truediv_1_grad/Neg}}]]                                                                                    
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
                                                                                                                                     
                                                                                                                                     
During handling of the above exception, another exception occurred:                                                                  
                                                                                                                                                                                                           
Traceback (most recent call last):                                                            
File "./train.py", line 291, in <module>                                                                                                           
    main()                                                                                                                                            
  File "./train.py", line 269, in main                                                                                               
    feed_dict={context: sample_batch()})                                                                                             
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run                 
    run_metadata_ptr)                                                                                                                
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run               
    feed_dict_tensor, options, run_metadata)                                                                                                         
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run            
    run_metadata)                                                                                                                    
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call           
    raise type(e)(node_def, op, message)                                                                                             
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,20,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node gradients/model/h18/attn/truediv_1_grad/Neg (defined at /home/surya/gpt-2/src/memory_saving_gradients.py:216) ]]     
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
                                                                                                                                                     
                                                                                                                                                    
Errors may have originated from an input operation.                                                                                  
Input Source operations connected to node gradients/model/h18/attn/truediv_1_grad/Neg:                                               
 model/h18/attn/Exp_1 (defined at /home/surya/gpt-2/src/memory_saving_gradients.py:204)                                                              
                                                                                                                                     
Original stack trace for 'gradients/model/h18/attn/truediv_1_grad/Neg':                                                                              
  File "./train.py", line 291, in <module>                                                                                           
    main()                                                                                                                                           
  File "./train.py", line 138, in main                                                                                               
    opt_grads = memory_saving_gradients.gradients(loss, train_vars)                                                                  
  File "/home/surya/gpt-2/src/memory_saving_gradients.py", line 216, in gradients                                                    
    dv = tf_gradients(ys=copied_ys, xs=boundary+xs, grad_ys=grad_ys, **kwargs)                                                       
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 158, in gradients       
    unconnected_gradients)                                                                                                                           
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 731, in _GradientsHelper               
    lambda: grad_fn(op, *out_grads))                                                                                                                                                                                    
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 403, in _MaybeCompile   
    return grad_fn()  # Exit early                                                                                                                   
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 731, in <lambda>        
    lambda: grad_fn(op, *out_grads))                                                                                                                 
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py", line 1147, in _RealDivGrad                        
    grad * math_ops.realdiv(math_ops.realdiv(-x, y), y), ry),                                                                                        
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 6633, in neg              
    "Neg", x=x, name=name)                                                                                                                                                             
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper                                                                                                                                   
    op_def=op_def)                                                                                                                   
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func          
    return func(*args, **kwargs)                                                                                                     
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op           
    op_def=op_def)                                                                                                                                                                                                      
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__            
    self._traceback = tf_stack.extract_stack()                                                                                       
                                                                                                                                     
...which was originally created as op 'model/h18/attn/truediv_1', defined at:                                                                                                                              
  File "./train.py", line 291, in <module>                                                                                                           
    main()                                                                                                                                              
[elided 0 identical lines from previous traceback]                                                                                   
  File "./train.py", line 138, in main                                                                                                                                                 
    opt_grads = memory_saving_gradients.gradients(loss, train_vars)                                                                                                                                                                                                           
  File "/home/surya/gpt-2/src/memory_saving_gradients.py", line 204, in gradients                                                    
    copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})                                                      
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 672, in copy_with_input_replacements
    sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)                                                                            
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 452, in __call__   
    self._copy_ops(info)                                                                                                             
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 466, in _copy_ops  
    op_, op_outputs_ = self.transform_op_handler(info, op, new_inputs)                                                               
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 176, in copy_op_handler                                                                  
    [], input_types_, None, op_def_)                                                                                                                 
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__                               
    self._traceback = tf_stack.extract_stack()

Running on personal machine with GPUs/everything installed. Worked for 345M model well but getting into memory issues for 774.

I made sure memory saving gradients were on and batch size was just 1, any suggestions?

Process gets killed when training

I am training with the smallest GPT2(117M parameters).

Loading dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 109.93it/s]
dataset has 42736 tokens
Training...
Killed

However the process gets killed as shown above. Any help is appreciated.

Splitting the model across multiple graphics cards

Like many others, I am getting an OOM issue. I think I can resolve this on my own, but am asking this question with the thought of future bigger models that might be created.

Is there a solid way to achieve model parallelism with GPT-2 so that the data can be split across multiple graphics cards? That way any OOM issue can be resolved by scaling up the number of graphics cards.

For my own selfish purposes I have many smaller graphics cards and want to see if that is a possibility before deciding to buy one very expensive one.

[SOLUTION] UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 29: character maps to <undefined>

In Win10 I was constantly getting the following error, when training the model in non-English language (special characters like čšž) - regardless of the model used:

Traceback (most recent call last):
File "train.py", line 293, in
main()
File "train.py", line 258, in main
generate_samples()
File "train.py", line 228, in generate_samples
fp.write('\n'.join(all_text))
File "C:\Users\6756\AppData\Local\Programs\Python\Python36\lib\encodings\cp1250.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 1581: character maps to

I have fixed the issue in file train.py, by adding encoding="utf-8" in the function generate_samples():

with open(
                    os.path.join(SAMPLE_DIR, args.run_name,
                                 'samples-{}').format(counter), 'w', encoding="utf-8") as fp:
                fp.write('\n'.join(all_text))

I thought someone else might be struggling with this issue and might find it useful.

How to train in multiple gpu

How can i train in multiple gpus?

Encode of a new dataset, confused about <|endoftext|> encoding

When encode a new dataset and use <|endoftext|> as delimiter, for example:

message <|endoftext|> message

The encode function in "src/encoder.py" will transform the encoding of "<|endoftext>" into [27, 91, 437, 1659, 5239, 91, 29] instead of [50256] (50256 is index of <|endoftext> in dict).

So I go to check "src/encoder.py", find that

import regex as re
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
text = "<|endoftext|> hi."
for token in re.findall(pat, text):
    print(token)

I get:

<|
endoftext
|>
hi
.

Why it splits <|endoftext|> into three parts (which I think it leads to the wrong encoding of <|endoftext|>)? Should it rather be:

<|endoftext|>
hi
.

how do I train gpt-2 using multiple encoding files?

Hi, I just got gpt-2 working under google collab.

I have two questions:
How much epochs do I need to let it run?
How do I train gpt-2 using multiple encoding files?

Many thanks.

Manipulating how a sample should start

Hi, I have been trying to add a 'Prefix' to the parameters in order to be able to tell the model how to start the beginning of a sample (something like https://github.com/minimaxir/gpt-2-simple did). Unfortunately, I failed. Do you have any idea on what to add into the 'interactive_conditional_samples.py'?

Thank you so much,

Fabrizio

Sampling structure looks weird. Maybe becuase I'm structuring my data wrongly?

My data, before encoding, looks like this:

Person 1: Something something something.
Person 2: Something something something.
Person 1: Something something something.

<|endoftext|>

Person 1: Something something something.
Person 2: Something something something.
Person 1: Something something something.

<|endoftext|>

But while training I get something that looks like this when sampling:

Generating samples...
======== SAMPLE 2 ========

<|endoftext|>

Person 1: Something something something.
Person 2: Something something something.
Person 1: Something something something.

<|endoftext|>

Or:

Generating samples...
======== SAMPLE 2 ========
Something Something Something Something Something Something Something Something Something Something Something Something Something Something Something Something Something Something Something Something...

<|endoftext|>

Person 1: Something something something.
Person 2: Something something something.
Person 1: Something something something.

<|endoftext|>

As you can see, the sample starts weirdly, not in the same structure as my original data when starts sampling but after that, they sort of correct themselves. Does somebody know what causing this?

How to generate interactive conditional samples after retraining on custom dataset?

When I retrain GPT-2 345M on custom dataset I get good samples during the training process itself, but after I stop the training and I run interactive_conditional_samples.py the samples are NOT from the same retrained model. How do I obtain samples from the new model?
Thanks

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU

Caused by op 'model/h3/attn/truediv_1', defined at:
File "train.py", line 293, in
main()
File "train.py", line 138, in main
opt_grads = memory_saving_gradients.gradients(loss, train_vars)
File "C:\Users\The Atomizer\Desktop\text\gpt2\memory_saving_gradients.py", line 250, in gradients
copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 673, in copy_with_input_replacements
sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 453, in call
self.copy_ops(info)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 467, in copy_ops
op, op_outputs = self.transform_op_handler(info, op, new_inputs)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 177, in copy_op_handler
[], input_types_, None, op_def_)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/h3/attn/truediv_1 (defined at C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py:177) = RealDiv[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h3/attn/Exp_1, model/h3/attn/Sum_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Encoding large single text files is not working

I m trying to encode a 1.7g txt file for training purposes. After starting the encode process from cmd I could see in task manager resources being drained but after ~30m everything went back to idle while the console output has not moved from reading files 0%. From what i can tell i have gpu working too with cudart64_101.dll loading.

System spec:
Gtx 970
i5-8400
8G ram+nvme ssd

Pls help cause scrapping this much was hard

Later Edit:
2nd try produced this error eventually

Traceback (most recent call last):
File "encode.py", line 31, in
main()
File "encode.py", line 25, in main
chunks = load_dataset(enc, args.in_text, args.combine, encoding=args.encoding)
File "C:_stash\openAI\gpt-2\src\load_dataset.py", line 35, in load_dataset
tokens = np.stack(enc.encode(raw_text))
File "C:_stash\openAI\gpt-2\src\encoder.py", line 100, in encode
bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
MemoryError

Later Later Edit:
encoding the folder containing individual text files w/o merging them into a single file worked fine

Training from scratch?

I see that you provide code for finetuning the pretrained models -- do you think that this code is also appropriate for training a model from scratch? Or are there other repos that you think would be more appropriate for from-scratch training?

Thanks!

Encoding on GPU

I have tried to encode a file on GPU, but it is still running on CPU. I can't encode that file, process get killed even before start.
python encode.py corpus_final.txt corpus_final.npz --model_name 345M
Reading files
0%| | 0/1 [00:00<?, ?it/s]Killed

this file has 1.9M sentences.

Why the label of training is like this

From the code in train.py i found the loss function:

        loss = tf.reduce_mean(
            tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=context[:, 1:], logits=output['logits'][:, :-1]))

But why does it have the slice [:, 1:] in labels and [:, :-1] in logits? why the slices are not the same?

Apologies, but HELP

when I try to encode a custom data set I receive these errors, I am relatively new to python and am unable to debug this myself. Ive been at it for two days now and I still cant figure it out. please help :(

(base) C:\Users\ThisPC\Downloads\Python\AI\gpt-2-finetuning>python encode.py WordBank.txt WordBank.npz
Traceback (most recent call last):
File "encode.py", line 31, in
main()
File "encode.py", line 23, in main
enc = encoder.get_encoder(args.model_name)
File "C:\Users\ThisPC\Downloads\Python\AI\gpt-2-finetuning\encoder.py", line 110, in get_encoder
encoder = json.load(f)
File "C:\Users\ThisPC\Anaconda3\lib\json_init_.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "C:\Users\ThisPC\Anaconda3\lib\json_init_.py", line 348, in loads
return _default_decoder.decode(s)
File "C:\Users\ThisPC\Anaconda3\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\ThisPC\Anaconda3\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Freezing layers while finetuning

Are there some pointers to go about finetuning only a few layers while freezing the others? Say, I just want to finetune the final layer while leaving weights for the other intact, etc.

A natural extension to this would be gradual unfreezing wherein for certain epochs -- finetune the higher layers while keeping other layers frozen, then unfreeze , similar to what ULMFiT (https://arxiv.org/abs/1801.06146) did.
Wanted to know if the above 2 are possible here -- more so the first one, thanks!

Zero Division Error

I'm getting the following error:

Traceback (most recent call last):
File "./train.py", line 293, in
main()
File "./train.py", line 271, in main
feed_dict={context: sample_batch()})
File "./train.py", line 247, in sample_batch
return [data_sampler.sample(1024) for _ in range(args.batch_size)]
File "./train.py", line 247, in
return [data_sampler.sample(1024) for _ in range(args.batch_size)]
File "/Users/crigas/gpt-2/src/load_dataset.py", line 74, in sample
self.chunks
ZeroDivisionError: integer division or modulo by zero

Any help would be appreciated!

GPT-2: Q/A Training Question

I am curious about how to train the GPT-2 on a Question/Answer dataset. From my understanding, the sample_sequence.py will take a corpus, and randomly break that corpus into 2 parts, and the goal of the network is to predict the second part using the 1st. Is this a correct understanding of the training cycle?

Because then, that means instead of sampling randomly text wise, I randomly sample question/answer wise. Where a question is the first part, and the answer is the second part correct?

Restriction on only training transformer layers?

I see in the training code that:


   if args.model_name == '345M':
        args.memory_saving_gradients = True
        args.only_train_transformer_layers = True

Why is the restriction imposed when finetuning for the 345M model? Am I missing something here?

NameError: name 'How' is not defined

Following this notebook I get the below error on the first c.show(). Any idea why this is happening?

You said: "Nice to meet you. What's your name?"
I said: "My name is Pete."
You said: "That's an interesting name. How old are you?"
I said: "I'm 40 years old."
You said: "Can you tell me something about yourself?"
I said: "Ofcourse! I like playing video games and eating cake. "
You said: "I like sweet stuff too. What are your plans for tomorrow?"

--> I said: "I actually have one new project I want to work on that I just can't get rid of. I want to start working on it. Just think of me as a genius, 'cause I'm by far one of the best creative graphic designers in the world :)"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-43-86df1d1172c0> in <module>()
----> 1 c.show()

<ipython-input-39-469f69714831> in show(self)
    107       party, answer  = self.suggestion
    108       print("--> "+answer)
--> 109     How
    110 

NameError: name 'How' is not defined

how to train on multi gpu

how to train on multi gpu？now it only use the first gpu by default

Train loss

Hi all,

I can't understand how the loss is computed, in particular what is being compared.

If i print the two tensors which appear in the loss term during execution, I get:

CONTEXT [[290 1526 1636 1526 75 1357 12 11...]...]

and

OUTPUT_LOGITS[[[-36.8163338 -36.7796745 -40.5458221 -39.6132202 -40.1266747 -40.50746...]]...]

Could you please explain how are they related? And how the training happens? Thanks a lot.

gpt2 translation task

I want to fine-tune the model and do some machine translation based on Gpt-2. I created my dataset according to the Gpt-2 paper in this format: 'sentence1 = translation1 \n sentence2 = translation2 \n ...' and did the fine-tune training. After training, I try to do the translation by 'python interactive_conditional_samples.py --top_k 40' but when I type in my input, it just show me a paragraph including several sentences(A = B \n B = C...), not the translation sentence of my input. Is there anything wrong with my input dataset or training? How could I do the machine translation by Gpt-2?

Windows doesn't automatically use UTF-8 encoding

Shouldn't all file operation have encoding="utf-8" added to make it more portable on other systems like Windows? Unless there is other global switch that could be applied at the beginning to not crash with a message "[...]charmap' codec can't encode character[...]"

Batch size in training GPT_2

I have a question about batch size in train.py. I wonder batch_size =1 meaning 1 token will pass to model or 1024 token. Anyone explain help me ? Thank a lots