Giter VIP home page Giter VIP logo

Comments (4)

king-menin avatar king-menin commented on July 4, 2024 1

Model was trained without this token you should add eos token and finetune model. Also we will release models trained with eos token.

from ru-gpts.

king-menin avatar king-menin commented on July 4, 2024

We train models without this token (use only plain text). But first tokens are specials in vocab. And what version of transformers do you use?

from ru-gpts.

mikhovr avatar mikhovr commented on July 4, 2024

To reproduce above behaviour, I used transformers==2.8.0 (from requirements.txt). My global goal is to make the model generate sentences separated from context, not just continue context sentence. For that I use text generation Pipeline from 3.4.0.

Moreover, the raw model tries to generate max_length tokens, even if the output is incomplete sentence. It seems that it doesn't consider </s> during beam search at all. I don't know if this behaviour is related.

The tokenizer attributes in 2.8.0 with using generate_transformers.py are:

<transformers.tokenization_gpt2.GPT2Tokenizer object at 0x7f11f646d9d0>
NO_PAD_TOKEN_FOR_BATCH_MSG = {str} 'No padding token is set for this model, therefore no batch can be made with uneven sequences. Set a padding token or adjust the lengths of the sequences building the batch so that every sequence is of the same length.'
SPECIAL_TOKENS_ATTRIBUTES = {list: 8} ['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
UNEVEN_SEQUENCES_FOR_BATCH_MSG = {str} 'The sequences building the batch are not of the same size, no tensor can be built. Set `pad_to_max_length=True` to pad the smaller sequencesup to the larger sequence\'s length.'
added_tokens_decoder = {dict: 0} {}
added_tokens_encoder = {dict: 0} {}
additional_special_tokens = {list: 0} []
additional_special_tokens_ids = {list: 0} []
all_special_ids = {list: 1} [None]
all_special_tokens = {list: 1} ['<|endoftext|>']
bos_token = {str} '<|endoftext|>'
bos_token_id = {NoneType} None
bpe_ranks = {dict: 49996} 
byte_encoder = {dict: 256} 
cache = {dict: 0} {}
cls_token = {NoneType} None
cls_token_id = {NoneType} None
decoder = {dict: 50257} {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: '!', 6: '"', ...
eos_token = {str} '<|endoftext|>'
eos_token_id = {NoneType} None
errors = {str} 'replace'
init_inputs = {tuple: 0} ()
init_kwargs = {dict: 2} {'vocab_file': '/home/superuser/khovrichev/ru-gpts/ckpt/gpt3_medium/vocab.json', 'merges_file': '/home/superuser/khovrichev/ru-gpts/ckpt/gpt3_medium/merges.txt'}
mask_token = {NoneType} None
mask_token_id = {NoneType} None
max_len = {int} 1000000000000
max_len_sentences_pair = {int} 1000000000000
max_len_single_sentence = {int} 1000000000000
max_model_input_sizes = {dict: 5} {'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
model_input_names = {list: 2} ['token_type_ids', 'attention_mask']
pad_token = {NoneType} None
pad_token_id = {NoneType} None
pad_token_type_id = {int} 0
padding_side = {str} 'right'
pat = {Pattern} regex.Regex("'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
pretrained_init_configuration = {dict: 0} {}
pretrained_vocab_files_map = {dict: 2} {'vocab_file': {'gpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json', 'gpt2-medium': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json', 'gpt2-large': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json', 'gpt2-xl': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json', 'distilgpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json'}, 'merges_file': {'gpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt', 'gpt2-medium': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt', 'gpt2-large': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt', 'gpt2-xl': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt', 'distilgpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt'}}
sep_token = {NoneType} None
sep_token_id = {NoneType} None
special_tokens_map = {dict: 3} {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
unique_added_tokens_encoder = {set: 1} {'<|endoftext|>'}
unk_token = {str} '<|endoftext|>'
unk_token_id = {NoneType} None
vocab_files_names = {dict: 2} {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}
vocab_size = {int} 50257

The tokenizer attributes (in 3.4.0) with using pipeline are:

PreTrainedTokenizer(name_or_path='sberbank-ai/rugpt3small_based_on_gpt2', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)})

SPECIAL_TOKENS_ATTRIBUTES = {list: 8} ['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
add_prefix_space = {bool} False
added_tokens_decoder = {dict: 0} {}
added_tokens_encoder = {dict: 0} {}
additional_special_tokens = {list: 0} []
additional_special_tokens_ids = {list: 0} []
all_special_ids = {list: 3} [None, None, None]
all_special_tokens = {list: 3} ['<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
all_special_tokens_extended = {list: 3} [AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)]
bos_token = {str} '<|endoftext|>'
bos_token_id = {NoneType} None
bpe_ranks = {dict: 49996} 
byte_decoder = {dict: 256} 
byte_encoder = {dict: 256} 
cache = {dict: 2} 
cls_token = {NoneType} None
cls_token_id = {NoneType} None
decoder = {dict: 50257} {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: '!', 6: '"',  ...
deprecation_warnings = {dict: 0} {}
encoder = {dict: 50257} {'<pad>': 0, '<s>': 1, '</s>': 2, '<unk>': 3, '<mask>': 4, '!': 5, '"': 6, '#': ...
eos_token = {str} '<|endoftext|>'
eos_token_id = {NoneType} None
errors = {str} 'replace'
init_inputs = {tuple: 0} ()
init_kwargs = {dict: 8} {'errors': 'replace', 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'add_prefix_space': False, 'special_tokens_map_file': None, 'tokenizer_file': None, 'name_or_path': 'sberbank-ai/rugpt3small_based_on_gpt2'}
is_fast = {bool} False
mask_token = {NoneType} None
mask_token_id = {NoneType} None
max_len = {int} 1000000000000000019884624838656
max_len_sentences_pair = {int} 1000000000000000019884624838656
max_len_single_sentence = {int} 1000000000000000019884624838656
max_model_input_sizes = {dict: 5} {'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
model_input_names = {list: 1} ['attention_mask']
model_max_length = {int} 1000000000000000019884624838656
name_or_path = {str} 'sberbank-ai/rugpt3small_based_on_gpt2'
pad_token = {NoneType} None
pad_token_id = {NoneType} None
pad_token_type_id = {int} 0
padding_side = {str} 'right'
pat = {Pattern} regex.Regex("'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
pretrained_init_configuration = {dict: 0} {}
pretrained_vocab_files_map = {dict: 2} {'vocab_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/vocab.json', 'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/vocab.json', 'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/vocab.json', 'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/vocab.json', 'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/vocab.json'}, 'merges_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/merges.txt', 'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/merges.txt', 'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/merges.txt', 'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/merges.txt', 'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/merges.txt'}}
sep_token = {NoneType} None
sep_token_id = {NoneType} None
slow_tokenizer_class = {NoneType} None
special_tokens_map = {dict: 3} {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
special_tokens_map_extended = {dict: 3} {'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}
unique_no_split_tokens = {list: 1} ['<|endoftext|>']
unk_token = {str} '<|endoftext|>'
unk_token_id = {NoneType} None
verbose = {bool} True
vocab_files_names = {dict: 2} {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}
vocab_size = {int} 50257


from ru-gpts.

mikhovr avatar mikhovr commented on July 4, 2024

Thank you! I'll add eos token manually.

from ru-gpts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.