cardiffnlp / tweeteval Goto Github PK

View Code? Open in Web Editor NEW

351.0 351.0 79.0 8.49 MB

Repository for TweetEval

Python 7.47% Jupyter Notebook 92.53%

tweeteval's People

Contributors

Stargazers

Watchers

tweeteval's Issues

Leaderboard clarification + fine-tuning scripts

I don't see the BERTweet results on your paper but on the github readme.

Is this the BERTweet base or large?
Can you push the scripts used for task fine-tuning these models?

Typo in TweetEval Tutorial

tweeteval/TweetEval_Tutorial.ipynb

Line 11 in 3f3bcd3

 "In this notebook we show how to perform tasks such as masked language modeling, computing tweet similarity or tweet classificationo using our Twitter-specific RoBERTa models.\n", 

I think I found a small typo where the word classification is written classificationo in the second line in the notebook

OSError: Can't load weights for model

Hello. I tried to run the samples but got an error in this part:

task='emotion'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL) <-- here is an error

An error:

OSError: Can't load weights for 'cardiffnlp/twitter-roberta-base-emotion'. Make sure that:

'cardiffnlp/twitter-roberta-base-emotion' is a correct model identifier listed on 'https://huggingface.co/models'

or 'cardiffnlp/twitter-roberta-base-emotion' is the correct path to a directory containing a file named one of tf_model.h5, pytorch_model.bin.

It was the same for just 'twitter-roberta-base'. What could be wrong here?
Thanks!

tokenized text has special character prepended

text = "Good night 😊"
encoded_input = tokenizer(text, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])) will print the line below. why are their Ġ characters in the tokenized text
['~~', 'Good', 'Ġnight', 'ĠðŁĺ', 'Ĭ', '~~']

mismatched lengths in hate training dataset

In hate/, there are 9000 tags but only 8993 tweets. Has anyone else run into this issue?

https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/hate/train_text.txt
https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/hate/train_labels.txt

Length of tag: 9000, Length of tweet: 8993

Possible privacy leak

As noted in issue #5, many valid user handles are present in the sentiment dataset. To the extent that this can be considered a privacy leak or violation of the Twitter Developer Agreement, it needs to be addressed.

Simply merging my PR would not suffice because the valid user handles could be reconstructed from file history.

If you need to put the genie back in the bottle, you may be able to delete the history for the affected files, but if not, you might have to rebuild the repo. Not being a true git wizard, I cannot make a recommendation without further background research, for which I unfortunately do not have the time at the moment. :/

Missing tokenizer model_max_length

Hi, not sure if this is something to do with your models in particular, or just a limitations of certain huggingface base models, but the tokenizers associated with your models for some reason have their model_max_length attribute undefined. This means longer texts will not be truncated to the maximum size of the model (even when passing truncation=True) and the model will then fail with index out of range when accessing the embeddings.

Actually, I've just seen that e.g. unitary/multilingual-toxic-xlm-roberta (also based on xlm roberta) doesn't fail in the same way, probably since it does define the model_max_length in its tokenizer_config.json.

Just in case you want to keep this in mind for possible improvements in your model configs... I'd have thought that huggingface's base models would have such parameters set by default when not explicitly stated, but it seems that's not the case :(

For reference, I now automatically "fix" all huggingface pipelines with a version of the below code:

MAX_SEQUENCE_LENGTH = 512
"""For now we don't need texts longer than this, ever."""

def ensure_tokenizer_max_length(tokenizer, model):
    """Ensure tokenizer has a max. length defined (#tokens) at which to truncate.

    Unfortunately many tokenizers don't seem to have this defined by default, which will
    lead to failure when using their resulting non-truncated outputs in a model which does
    have a maximum size.
    """
    max_length = getattr(tokenizer, "model_max_length", None)
    if max_length is None or max_length > MAX_SEQUENCE_LENGTH:
        LOG.warning(f"Tokenizer's model_max_length={max_length} probably wasn't set correctly.")

        default_lengths = getattr(tokenizer, "max_model_input_sizes", {})
        if default_lengths:
            k, v = next(iter(default_lengths.items()))
            LOG.warning(f"Found and will use default max length for model {k}={v}.")
            max_length = v
        else:
            model_len = model.config.to_dict().get("max_position_embeddings")
            if model_len is not None:
                LOG.warning(f"Found no default max length but model defines max_position_embeddings={model_len}")
                if model_len in (514, 130):
                    model_len -= 2
                    LOG.warning(f"Corrected max length to be {model_len}.")
                max_length = model_len
            else:
                LOG.warning(f"Couldn't determine appropriate max length. Will use default of {MAX_SEQUENCE_LENGTH}.")
                max_length = MAX_SEQUENCE_LENGTH

    tokenizer.model_max_length = max_length

Leftover utf-codes in emoji analysis texts

Hi there,

We have used the emoji dataset for a paper, but during our research we found there were many sentences in this dataset that contained leftover emoji utf-codes resulting in an dataset that was incompletely masked.

When using the dataset from https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emoji/train_text.txt, you can find the utf-code 65039 hidden within the texts. These were once part of a combinatory emoticon depicting for example a couple (see picture of combinations).

Due to these leftovers, NNs trained on this are "surpisingly good" at predicting heart emoticons. With the code below you can detect which lines in the text contain this utf-character.

f = open("example.txt", "r", encoding="utf-8")
lines = f.readlines()
for i,l in enumerate(lines):
    ls = [*l]
    leftovers = list(filter(lambda x: ord(x) == 65039, ls))
    if(len(leftovers) > 0):
        print(f"Line {i} has leftover emoji's: \"{l}\"")

See for further explanation of the problem this excerpt of our paper:

PR: tweeteval as python package

Hi, I'm Thomas from Graphext, perhaps Victoriano has already alerted you, but we're planning to integrate your benchmarks in our own systems, and it would be a little easier if tweeteval was an actual, installable package, and if it had a library interface.

So I just went ahead, forked your repo, and refactored it into a package: https://github.com/graphext/tweeteval/tree/as_package.

You should be able to install this locally with

> pip install git+https://github.com/graphext/tweeteval.git@as_package#egg=tweeteval

which will then let you do e.g.

> tweeteval all

{'emoji': 0.3155243507716184, 'emotion': 0.7982724123055319, 'hate': 0.5547114323640363, 'irony': 0.6247755834829443, 'offensive': 0.8155092112424851, 'stance': 0.7243628109019552}

from tweeteval import resources, score

task = "emoji"

# This retrieves the predictions from the best model included in this repo
preds = resources.task_preds(task)
score(task, preds)

Also see the updated README in the as_package branch of our fork.

I just wanted to see whether you're interested in merging a PR if I made one? If you're interested, the only thing to iron out would be some doubts regarding the actual data included in the repo.

baseline scripts and predictions

Dear colleagues,

We are wondering if you'd consider releasing the scripts and predictions of the baseline models as well. (Since the paper does not explicitly disclose much details on the baselines) We here are attempting to re-implement the baseline models for the hate speech detection subtask for an inferential reproducibility analysis. Thank you!

Kind regards,
Xu

Add correct labels in HF models config

Just a suggestion, but e.g. here: https://huggingface.co/cardiffnlp/twitter-roberta-base-emoji/blob/main/config.json, if id2label and label2id were setup correctly, one wouldn't have to separately load the mappings.txt file from this github repo, and do the mapping manually.

How to get the Roberta results

Hi !
I am interested in how you finetune roberta and get the results in the table. As only the data and evaluation script is open source, if you can open source the finetuning code, it will be much easier for other researchers to use your benchmark.
Thanks !

Gold labels and predictions on sentiment task don't match

Maybe it has to do with your update from last december, as mentioned in the README, but sentiment/test_labels.txt has 12,284 lines and predictions/sentiment.txt 11,906.

Two severe problems in sentiment dataset

Problem 1

Hundreds of lines in the datasets/sentiment train and test text files contain multiple labels and tweets appended to a tweet. For example, line 44626 of datasets/sentiment/train_text.txt reads in part:

Batman may of been the better man in our last encounter,   but I'm a man on a mission and it's to expose you for what you really are and --640710805508497415	positive	Heath Ledger in Batman may be the best performance of any actor in any movie of all time640711961252990976	neutral	@[OMITTED_FOR_PRIVACY] I mean, Batman was corny af until this trilogy, the 3rd one was so not Batman.640766902772502528	neutral	@[OMITTED_FOR_PRIVACY] TOM HARDY should play BANE again... He should kill BATMAN when they want 2 replace like in the comics... The 3rd... KNIGHTFALL640820660214730752	negative	Whatever, Batman. You ....

The line continues for hundreds more characters, but has been truncated for the sake of brevity.

Problem 2

Many of the appended tweets contain valid Twitter handles. In the example above, I have deliberately obfuscated the problem by substituting @[OMITTED_FOR_PRIVACY] for @ValidHandle.

Proposed solution

I fixed problem 1 in my fork repo and submitted a pull request. I closed it because it did not contain a fix for problem 2. I have subsequently fixed both problems and will submit the new PR momentarily. To assist your verification, I am listing the code below; since the code will no longer be needed once the problem is fixed, I am not including it in the PR.

from enum import IntEnum
from os import path
import re

class POLARITY(IntEnum):
    negative = 0
    neutral = 1
    positive = 2

DATA_DIR = './data/'
FIXED_DATA_DIR = './fixed/'

TRAIN_TWEETS = 'train_text.txt'
TRAIN_LABELS = 'train_labels.txt' 
TEST_TWEETS = 'test_text.txt' 
TEST_LABELS = 'test_labels.txt' 


def get_tweets(tweet_file):
    "Returns a list of strings representing tweets"
    file_path = path.join(DATA_DIR, tweet_file)
    # tweets can have commas, so we don't want to use pd.read_csv
    with open(file_path, encoding='UTF-8') as t_file:
        tweets = [tw.strip() for tw in t_file]
    return tweets

def get_labels(label_file):
    """Returns a list of ints representing labels"""
    file_path = path.join(DATA_DIR, label_file)
    with open(file_path, encoding='UTF-8') as l_file:
        labels = [int(lab.strip()) for lab in l_file]
    return labels

tweet_id_pattern = re.compile(r"\d{18,}") # A Tweet ID has 18 digits
user_handle_pattern = re.compile("@\w+")

def correct_data(tweets, labels):
    """Return a tuple of corrected lists: (tweets, labels)"""
    new_tweets, new_labels = [], []
    num_examples = len(tweets)
    assert num_examples == len(labels)
    for i in range(num_examples):
        tweet = tweets[i]
        if tweet_id_pattern.search(tweet) is None:
            continue
        records = tweet_id_pattern.split(tweet)
        tweets[i] = records[0]
        for record in records[1:]:
            parts = record[1:].split('\t')
            if len(parts) != 2:
                continue
            polarity = parts[0]
            tweet = parts[1]
            # anonymize any users
            tweet = user_handle_pattern.sub("@user", tweet)
            new_tweets.append(tweet)
            new_labels.append(POLARITY[polarity].value)
    return tweets + new_tweets, labels + new_labels

def write_data(corrected_data, file_name):
    file_path = path.join(FIXED_DATA_DIR, file_name)    
    with open(file_path, 'w', encoding='UTF-8') as f:
        for item in corrected_data:
            f.write(str(item))
            f.write('\n')

file_names = zip([TRAIN_TWEETS, TEST_TWEETS], [TRAIN_LABELS, TEST_LABELS])
for tweet_file, label_file in file_names:
    tweets = get_tweets(tweet_file)
    labels = get_labels(label_file)
    fixed_tweets, fixed_labels = correct_data(tweets, labels)
    write_data(fixed_tweets, tweet_file)
    write_data(fixed_labels, label_file)

Cannot Load Google Colab Notebook: "Invalid Credentials" Error

I am logged in to Chrome as my gmail user. When I attempt to open the Google Colab notebook, the following error is raised:

There was an error loading this notebook. Ensure that the file is accessible and try again.
Invalid Credentials
GapiError: Invalid Credentials
    at wB.eu [as constructor] (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:682:357)
    at new wB (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:1244:97)
    at Ca.program_ (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:1526:20)
    at Ea (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:19:336)
    at Ca.throw_ (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:18:402)
    at Ga.throw (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:20:248)
    at c (https://colab.research.google.com/v2/external/external_polymer_binary.js?vrz=colab-20201124-085601-RC00_344043294:20:501)

No access to the notebook is permitted after the error message is displayed.

I would be very grateful if you could fix this problem ASAP, as I would like to understand and make use of your code in some research on Black Lives Matter I am conducting under the supervision of YY Ahn at Indiana U. Thanks!

Missing licenses on huggingface model cards

Can you please add licenses to your models on huggingface?

RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:

When loading the sequence classification model

model = AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')

RobertaTokenizerFast has an issue when working on mask language modeling where it introduces an extra encoded space before the mask token.See [https://github.com/huggingface/transformers/pull/2778]() for more information.
Downloading: 100%|██████████| 481/481 [00:00<00:00, 479kB/s]
size mismatch for classifier.out_proj.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
size mismatch for classifier.out_proj.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).

Minor problem in the model card of huggingface

Hi,

I have a problem when I use your sample code in https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment#example-of-classification.

The problem is that the code will fail the second time it runs due to some missing configuration files related to the tokenizer.

I would suggest adding one line as follows to save the metadata of tokenizer.

model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)
tokenizer.save_pretrained(MODEL)  # adding this line

Note that this change should be applied to all your model cards of huggingface.

I hope the description is clear enough for you. Please feel free to contact me for further clarification.

Can't load pre-trained model

Hi, I am trying to load the model as written in:
https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion
But in the line:

MODEL = "cardiffnlp/twitter-roberta-base-emotion"
AutoModelForSequenceClassification.from_pretrained(MODEL)

I get the error:

e-packages/transformers/modeling_utils.py", line 842, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load weights for 'cardiffnlp/twitter-roberta-base-emotion'. Make sure that:
- 'cardiffnlp/twitter-roberta-base-emotion' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'cardiffnlp/twitter-roberta-base-emotion' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

Also tried to put
MODEL = "cardiffnlp/bertweet-base-emotion" but got the same error.
Both models are here: https://huggingface.co/models
Any idea what might be the problem?

Warning when loading the sentiment-analysis model

Hey, I follow the instructions on the huggingface hub page to load the sentiment analysis model,

sentiment_pipeline = pipeline("sentiment-analysis",
                           model="cardiffnlp/twitter-roberta-base-sentiment-latest",
                           tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
                           device=0)
# I can replicate the score 
sentiment_task("Covid cases are increasing fast!")
>> [{'label': 'Negative', 'score': 0.7236}]

but I got the warning below:

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

I wonder if you can explain why the warning was generated and should we worry about it?

cardiffnlp / tweeteval Goto Github PK

tweeteval's People

Contributors

Stargazers

Watchers

Forkers

tweeteval's Issues

Problem 1

Problem 2

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org