Comments (4)
Model was trained without this token you should add eos token and finetune model. Also we will release models trained with eos token.
from ru-gpts.
We train models without this token (use only plain text). But first tokens are specials in vocab. And what version of transformers do you use?
from ru-gpts.
To reproduce above behaviour, I used transformers==2.8.0
(from requirements.txt). My global goal is to make the model generate sentences separated from context, not just continue context sentence. For that I use text generation Pipeline from 3.4.0
.
Moreover, the raw model tries to generate max_length
tokens, even if the output is incomplete sentence. It seems that it doesn't consider </s>
during beam search at all. I don't know if this behaviour is related.
The tokenizer attributes in 2.8.0
with using generate_transformers.py are:
<transformers.tokenization_gpt2.GPT2Tokenizer object at 0x7f11f646d9d0>
NO_PAD_TOKEN_FOR_BATCH_MSG = {str} 'No padding token is set for this model, therefore no batch can be made with uneven sequences. Set a padding token or adjust the lengths of the sequences building the batch so that every sequence is of the same length.'
SPECIAL_TOKENS_ATTRIBUTES = {list: 8} ['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
UNEVEN_SEQUENCES_FOR_BATCH_MSG = {str} 'The sequences building the batch are not of the same size, no tensor can be built. Set `pad_to_max_length=True` to pad the smaller sequencesup to the larger sequence\'s length.'
added_tokens_decoder = {dict: 0} {}
added_tokens_encoder = {dict: 0} {}
additional_special_tokens = {list: 0} []
additional_special_tokens_ids = {list: 0} []
all_special_ids = {list: 1} [None]
all_special_tokens = {list: 1} ['<|endoftext|>']
bos_token = {str} '<|endoftext|>'
bos_token_id = {NoneType} None
bpe_ranks = {dict: 49996}
byte_encoder = {dict: 256}
cache = {dict: 0} {}
cls_token = {NoneType} None
cls_token_id = {NoneType} None
decoder = {dict: 50257} {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: '!', 6: '"', ...
eos_token = {str} '<|endoftext|>'
eos_token_id = {NoneType} None
errors = {str} 'replace'
init_inputs = {tuple: 0} ()
init_kwargs = {dict: 2} {'vocab_file': '/home/superuser/khovrichev/ru-gpts/ckpt/gpt3_medium/vocab.json', 'merges_file': '/home/superuser/khovrichev/ru-gpts/ckpt/gpt3_medium/merges.txt'}
mask_token = {NoneType} None
mask_token_id = {NoneType} None
max_len = {int} 1000000000000
max_len_sentences_pair = {int} 1000000000000
max_len_single_sentence = {int} 1000000000000
max_model_input_sizes = {dict: 5} {'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
model_input_names = {list: 2} ['token_type_ids', 'attention_mask']
pad_token = {NoneType} None
pad_token_id = {NoneType} None
pad_token_type_id = {int} 0
padding_side = {str} 'right'
pat = {Pattern} regex.Regex("'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
pretrained_init_configuration = {dict: 0} {}
pretrained_vocab_files_map = {dict: 2} {'vocab_file': {'gpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json', 'gpt2-medium': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json', 'gpt2-large': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json', 'gpt2-xl': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json', 'distilgpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json'}, 'merges_file': {'gpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt', 'gpt2-medium': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt', 'gpt2-large': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt', 'gpt2-xl': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt', 'distilgpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt'}}
sep_token = {NoneType} None
sep_token_id = {NoneType} None
special_tokens_map = {dict: 3} {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
unique_added_tokens_encoder = {set: 1} {'<|endoftext|>'}
unk_token = {str} '<|endoftext|>'
unk_token_id = {NoneType} None
vocab_files_names = {dict: 2} {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}
vocab_size = {int} 50257
The tokenizer attributes (in 3.4.0
) with using pipeline are:
PreTrainedTokenizer(name_or_path='sberbank-ai/rugpt3small_based_on_gpt2', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)})
SPECIAL_TOKENS_ATTRIBUTES = {list: 8} ['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
add_prefix_space = {bool} False
added_tokens_decoder = {dict: 0} {}
added_tokens_encoder = {dict: 0} {}
additional_special_tokens = {list: 0} []
additional_special_tokens_ids = {list: 0} []
all_special_ids = {list: 3} [None, None, None]
all_special_tokens = {list: 3} ['<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
all_special_tokens_extended = {list: 3} [AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)]
bos_token = {str} '<|endoftext|>'
bos_token_id = {NoneType} None
bpe_ranks = {dict: 49996}
byte_decoder = {dict: 256}
byte_encoder = {dict: 256}
cache = {dict: 2}
cls_token = {NoneType} None
cls_token_id = {NoneType} None
decoder = {dict: 50257} {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: '!', 6: '"', ...
deprecation_warnings = {dict: 0} {}
encoder = {dict: 50257} {'<pad>': 0, '<s>': 1, '</s>': 2, '<unk>': 3, '<mask>': 4, '!': 5, '"': 6, '#': ...
eos_token = {str} '<|endoftext|>'
eos_token_id = {NoneType} None
errors = {str} 'replace'
init_inputs = {tuple: 0} ()
init_kwargs = {dict: 8} {'errors': 'replace', 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'add_prefix_space': False, 'special_tokens_map_file': None, 'tokenizer_file': None, 'name_or_path': 'sberbank-ai/rugpt3small_based_on_gpt2'}
is_fast = {bool} False
mask_token = {NoneType} None
mask_token_id = {NoneType} None
max_len = {int} 1000000000000000019884624838656
max_len_sentences_pair = {int} 1000000000000000019884624838656
max_len_single_sentence = {int} 1000000000000000019884624838656
max_model_input_sizes = {dict: 5} {'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
model_input_names = {list: 1} ['attention_mask']
model_max_length = {int} 1000000000000000019884624838656
name_or_path = {str} 'sberbank-ai/rugpt3small_based_on_gpt2'
pad_token = {NoneType} None
pad_token_id = {NoneType} None
pad_token_type_id = {int} 0
padding_side = {str} 'right'
pat = {Pattern} regex.Regex("'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
pretrained_init_configuration = {dict: 0} {}
pretrained_vocab_files_map = {dict: 2} {'vocab_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/vocab.json', 'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/vocab.json', 'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/vocab.json', 'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/vocab.json', 'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/vocab.json'}, 'merges_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/merges.txt', 'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/merges.txt', 'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/merges.txt', 'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/merges.txt', 'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/merges.txt'}}
sep_token = {NoneType} None
sep_token_id = {NoneType} None
slow_tokenizer_class = {NoneType} None
special_tokens_map = {dict: 3} {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
special_tokens_map_extended = {dict: 3} {'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}
unique_no_split_tokens = {list: 1} ['<|endoftext|>']
unk_token = {str} '<|endoftext|>'
unk_token_id = {NoneType} None
verbose = {bool} True
vocab_files_names = {dict: 2} {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}
vocab_size = {int} 50257
from ru-gpts.
Thank you! I'll add eos token manually.
from ru-gpts.
Related Issues (20)
- describe carbon emission
- ruGPT3XL_generation.ipynb not working HOT 3
- Новость курс
- AssertionError: model parallel group is not initialized HOT 1
- The model requires `num_beams`, although it is not needed in the example HOT 3
- Ru-gpts for chit-chat bot HOT 2
- Прямая трансляция по apex legends HOT 1
- Games
- Correct data format for fine-tuning RUGPT3 models
- A
- The XL Model and the latest DeepSpeed
- Как настроить на вопрос\ответ? HOT 2
- Apackage missing HOT 2
- Hello
- Are there hardware requirements to execute the script? HOT 17
- Ускорение инференса rugpt3-large HOT 1
- Как embedding'и получить и какой они длины? HOT 1
- Unable to use RuGPT3FinetuneHF.ipynb Colab notebook HOT 1
- Link to code implementation is not available
- No "nvcc" utilite founded during environment installation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ru-gpts.