ericfillion / happy-transformer Goto Github PK

Happy Transformer makes it easy to fine-tune and perform inference with NLP Transformer models.

License: Apache License 2.0

Python 100.00%

language-models artificial-intelligence ai question-answering bert roberta nlp machine-learning text-classification deep-learning transformers python natural-language-processing

happy-transformer's Introduction

Happy Transformer

Documentation and news: happytransformer.com

Join our Discord server:

Happy Transformer makes it easy to fine-tune and perform inference with NLP Transformer models.

3.0.0

DeepSpeed for training
Apple's MPS for training and inference
WandB to track training runs
Data supplied for training is automatically split into portions for training and evaluating
Push models directly to Hugging Face's Model Hub

Read about the full 3.0.0 update including breaking changes here.

Tasks

Tasks	Inference	Training
Text Generation	✔	✔
Text Classification	✔	✔
Word Prediction	✔	✔
Question Answering	✔	✔
Text-to-Text	✔	✔
Next Sentence Prediction	✔
Token Classification	✔

Quick Start

pip install happytransformer

from happytransformer import HappyWordPrediction
#--------------------------------------#
happy_wp = HappyWordPrediction()  # default uses distilbert-base-uncased
result = happy_wp.predict_mask("I think therefore I [MASK]")
print(result)  # [WordPredictionResult(token='am', score=0.10172799974679947)]
print(result[0].token)  # am

Maintainers

Eric Fillion Lead Maintainer
Ted Brownlow Maintainer

Tutorials

Text generation with training (GPT-Neo)

Text classification (training)

Text classification (hate speech detection)

Text classification (sentiment analysis)

Word prediction with training (DistilBERT, RoBERTa)

Top T5 Models

Grammar Correction

Fine-tune a Grammar Correction Model

happy-transformer's People

Contributors

Stargazers

Watchers

happy-transformer's Issues

Fix files failing pylint

pylint has a number of failures on a number of files. To enable it in the actions we need to fix all of the pylint errors throughout the files first. After this is done the "pylint" line in the pythonapp.yml file can be uncommented to run again.

Fine tuning XLNet

Perform research on how we could add methods to HappyBERT that enable fine tuning. Post a 200 word paragraph to the report section of our google drive about your findings. Include potential resources we could use.

Any plans on other recent models like ALBERT?

Hi Eric,

Thanks for your wonderful work! I was wondering you would like to integrate some other recent models like ALBERT in the near future. That would make the experiments using happy-transformer more comprehensive!

Disable logging from transformer models when they are initializing

Begin the implementation of GleefulTransformer

Use the outputs of "k option in predict_mask" as features for a final model that optimally combines these outputs.

Do research on the structure of the model including specifics on the features and the output. It would be better if the model could predict k options as well instead of only the top 1.

How can we save/load the fine-tuned MWP models?

Thanks!

Create a separate __get_prediction_softmax for GPT2

Make sure you can get GPU support to work. Read their documentation and include all of the available inputs.

Turn finetuned model to happytransformer model

The finetuning method for language models returns the finetuned model.

Need a way of going from base model to happytransformer model.
model.from_pretrained() requires a directory but we only have a model object.

Use GitHub actions workflows to set up workflow

PR #18 created the yml file manually. However, should use the github actions tab to create the workflows

Create .gitignore

Good starting point: https://github.com/github/gitignore/blob/master/Python.gitignore

Fine tuning BERT

Create a logger info message that warns users who do not use HappyROBERTA Large for masked word prediction

All other language models do not perform as well as HappyROBERTA large for masked word prediction. We should encourage users to use HappyROBERTA Large by displaying a logger message if they use a suboptimal language model. This message will encourage them to use HappyROBERTA Large.

There are still some situations where a user may want to use another model, so we will keep them available.

XLNet-base-uncased not found

Hi I tried to run the example script on Readme.md and it gives me the error that XLNet-base-uncased not found

Model name 'xlnet-base-uncased' was not found in tokenizers model name list (xlnet-base-cased, xlnet-large-cased). We assumed 'xlnet-base-uncased' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

Thank you

Clean up predict_mask for all child classes

Lots of code is reused amongst the child classes for predict_mask. Create methods within the parent class that perform some of the common tasks to make the code within the child classes shorter and easier to understand.

Inconsistent masked word prediction results

Hi!

Thanks very much for the great work! I found a trouble of ROBERTA's masked word prediction results are not the same as the official fairseq one.

You can test it by the example here: https://github.com/pytorch/fairseq/tree/master/examples/roberta#filling-masks

GPU support for all models

Glad to see this exists

Will try it out + leave a review within next 2 weeks

Create a separate __get_prediction_softmax for BERT

Make sure you can get GPU support to work. Read their documentation and include all of the available inputs.

Create pytest files for all other transformer models

Fix RoBERTa multi sentence masked word prediction

Research the requirements for publishing to PyPi

Write a 200 word summary of your findings and post it to the report section of the google drive. Focus on what modules need to be added to what we currently have.

Create a testing class for the Winograd Schema Challenge

Use "predict_mask_with_options" for each child class to generate results for the WSC273. Within the WSCTesting class, create a method that performs the test for each child class. Also create a single method that calls each of these tests at once and outputs the results.

Update README for "is_next_sentence"

Prevent users from inputting text that contains more than 1 sentence for "is_next_sentence"

Add classifier_args to init.py

Support language selection (For example, Russian).

Hello from Russia! Can you add language selection support? Accordingly, Russian. I have the task of predicting the next word from an available list of words. Or, if there is, how can I use it?

mask_lm_labels should be -100 instead of -1?

Hi,

We were running the fine-tuning mwp examples and found it reported a bug saying "Target -1 is out of bound". After debugging, we find that the doc of BertForMaskedLM says

masked_lm_labels (:obj:torch.LongTensor of shape :obj:(batch_size, sequence_length), optional, defaults to :obj:None):
Labels for computing the masked language modeling loss.
Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring)
Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels

Thus, we modified the mask_tokens function by labels[~masked_indices] = -100 # We only compute loss on masked tokens. And it works now. Not sure if it is correct, could you please do a double check ?

Any plan on adding HappyTransformer-XL soon?

Sentence perplexity for BERT

Create a sentence perplexity method for BERT. Perhaps this method can be placed in the parent class; that way, all children can use it.

Create a separate __get_prediction_softmax for XLM

Make sure you can get gpu support to work. Read their documentation and include all of the available inputs.

Predict k masked words

Create a transformer method to predict k number of masked words and return a list with the options in descending order by softmax

Use RoBERTa to complete the Billion Word Imputation Challenge

Create a initialization variable for the happy transformer classes to set the model to cased or uncased

You will need to create special cases for models like XLNet that only have cased models.

Predict multiple masked words

Is there a way to predict multiple masked words for this? So for example if I have a sentence:

"[MASK] have a [MASK] dog and I love [MASK] so much"

Thank you so much!

Create a separate __get_prediction_softmax for XLNet

Make sure you can get GPU support to work. Read their documentation and include all of the available inputs.

The sentence separator in TextDataset.init seems wrong

    def __init__(self, tokenizer, file_path, block_size=512):
        assert os.path.isfile(file_path)
        with open(file_path, encoding="utf-8") as f:
            text = f.read()

        tokenized_text = tokenizer.encode(
            text, add_special_tokens=True)  # Get ids from text
        self.examples = []
        # Truncate examples to a max blocksize
        for i in range(0, len(tokenized_text) - block_size + 1, block_size):
            self.examples.append(tokenized_text[i:i + block_size])

Here it seems to separate the file in file_path with the block_size instead of the \n ? Thus, the provided example in the readme cannot be trained.

lr always 0 when fine-tuning mlm

Hi,

I found that the fine-tuned model did not do better on the mwp on the trained corpus, so I debug the code by showing its learning rater as follows in the train function.

  if global_step % logging_steps == 0:
                # tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                print()
                print("\t lr:", scheduler.get_lr()[0])
                print("\t avg loss:", (tr_loss - logging_loss) / logging_steps, global_step)
                logging_loss = tr_loss

However, I found that lr is always 0..

Change "is_next_sentence" to "predict_next_sentence"

Finetuning MWP with custom masks?

Hi Eric,

Thanks for the feature of fine-tuning MWP. I am wondering if we can write our own customized masking strategy instead of random sampling?

Simply put, I have a list of sentences where masks are already created, and we know the associated words for each mask. Can we fine-tune HappyBERT/HappyROBERTA for them?

Thanks!

ericfillion / happy-transformer Goto Github PK

happy-transformer's Introduction

Happy Transformer

3.0.0

Tasks

Quick Start

Maintainers

Tutorials

happy-transformer's People

Contributors

Stargazers

Watchers

Forkers

happy-transformer's Issues

Recommend Projects

Recommend Topics

Recommend Org