Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How to choose unlabelled data about pet HOT 6 CLOSED

timoschick commented on August 17, 2024

How to choose unlabelled data

from pet.

Comments (6)

timoschick commented on August 17, 2024

Hi @Punchwes, there are two options for limiting the number of unlabeled examples:

You can specify --unlabeled examples <k> for some natural number <k>, e.g. --unlabeled examples 40000. If you do so, the entire set of unlabeled examples is shuffled and then the first 40,000 examples in the shuffled dataset are chosen. Of course, this does not guarantee that there is an equal number of examples for each label.
You can specify --unlabeled examples <k> --split_examples_evenly for some natural number <k> as above. In this case, if your dataset has <n> labels, for each label, the first <k>/<n> examples that can be found in the (unshuffled) unlabeled dataset are chosen.

For our experiments on AG's News, we chose the second option (that is, --unlabeled examples 40000 --split_examples_evenly). If you wanted to combine both options (shuffle the dataset and select the same number of examples for each label), you'd have to implement this yourself, but it should not require more than one or two lines of code.

I hope this answers your question!

from pet.

Punchwes commented on August 17, 2024

Hi @timoschick , thanks for your quick reply.

I think the method your describe in the paper corresponds to the second option, what confuses me is that in the code it seems that -split_examples_evenly never applies to unlabeled data.

As your code comment in tasks.py:

    assert (not set_type == UNLABELED_SET) or (num_examples is not None), \
        "For unlabeled data, 'num_examples_per_label' is not allowed"

and in the example loading part in cli.py:

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

there's no num_examples_per_label parameter passing to unlabeled_data loading. This is the reason why I am confused it seems that you would always choose the first option for unlabeled data.

    if args.split_examples_evenly:
        train_ex_per_label = eq_div(args.train_examples, len(args.label_list)) if args.train_examples != -1 else -1
        test_ex_per_label = eq_div(args.test_examples, len(args.label_list)) if args.test_examples != -1 else -1
        train_ex, test_ex = None, None

and unlabeled data seems not be involved in the split_examples_evenly part as I could see.

Or I missed something in the code where the --split_examples_evenly can be applied to unlabeled data.

from pet.

timoschick commented on August 17, 2024

Oh right, my mistake, you are absolutely correct!
For our AG's News results, we used an older version of the code (the corresponding file can still be found here). Back then, examples were always split evenly across all labels, so option (1) from my previous comment was not possible and option (2) was the default. When I wrote the current version of PET, I explicitly removed the num_examples_per_label option for unlabeled data because in a real-world setting, of course you do not have labels for unlabeled data so back then I felt like this was a sensible choice. But of course this means that with the current version of PET, option (2) from my previous comment is not possible anymore. So you'd have to either

modify the code by removing the assertion and applying the if args.split_examples_evenly: [...] code block also to unlabeled examples or
write a script that extracts the first 10,000 examples for each label and writes them to a separate file, and then use this separate file as input.

Sorry for the confusion!

from pet.

Punchwes commented on August 17, 2024

Thanks very much for this clarification, very helpful and it makes sense to remove the option for unlabelled data.

One last question I have is about seed. You mentioned in the paper that:

each model is trained three times using different seeds and average results are reported

After checking the code, it seems that the seed parameter passed by command line (args.seed) is not used to choose data examples,

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

seed in the load_examples function is fixed as 42:

def load_examples(task, data_dir: str, set_type: str, *_, num_examples: int = None,
                  num_examples_per_label: int = None, seed: int = 42) -> List[InputExample]:

So I wonder when you run the model 3 times with different seeds, do you also change the seed in load_example() manually?

from pet.

timoschick commented on August 17, 2024

For our experiments in Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference, we use the same set of examples for all three runs. The different seeds only affect the initialization of model parameters (for regular supervised training), dropout and the shuffling of training examples (i.e., the order in which they are presented to the model), which happens here.

If you're interested in how different sets of training examples affect performance, you might find Table 6 in this paper useful.

from pet.

Punchwes commented on August 17, 2024

Thanks very much!

from pet.

How to choose unlabelled data about pet HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent