Giter VIP home page Giter VIP logo

Comments (6)

timoschick avatar timoschick commented on August 17, 2024

Hi @Punchwes, there are two options for limiting the number of unlabeled examples:

  1. You can specify --unlabeled examples <k> for some natural number <k>, e.g. --unlabeled examples 40000. If you do so, the entire set of unlabeled examples is shuffled and then the first 40,000 examples in the shuffled dataset are chosen. Of course, this does not guarantee that there is an equal number of examples for each label.

  2. You can specify --unlabeled examples <k> --split_examples_evenly for some natural number <k> as above. In this case, if your dataset has <n> labels, for each label, the first <k>/<n> examples that can be found in the (unshuffled) unlabeled dataset are chosen.

For our experiments on AG's News, we chose the second option (that is, --unlabeled examples 40000 --split_examples_evenly). If you wanted to combine both options (shuffle the dataset and select the same number of examples for each label), you'd have to implement this yourself, but it should not require more than one or two lines of code.

I hope this answers your question!

from pet.

Punchwes avatar Punchwes commented on August 17, 2024

Hi @timoschick , thanks for your quick reply.

I think the method your describe in the paper corresponds to the second option, what confuses me is that in the code it seems that -split_examples_evenly never applies to unlabeled data.

As your code comment in tasks.py:

    assert (not set_type == UNLABELED_SET) or (num_examples is not None), \
        "For unlabeled data, 'num_examples_per_label' is not allowed"

and in the example loading part in cli.py:

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

there's no num_examples_per_label parameter passing to unlabeled_data loading. This is the reason why I am confused it seems that you would always choose the first option for unlabeled data.

    if args.split_examples_evenly:
        train_ex_per_label = eq_div(args.train_examples, len(args.label_list)) if args.train_examples != -1 else -1
        test_ex_per_label = eq_div(args.test_examples, len(args.label_list)) if args.test_examples != -1 else -1
        train_ex, test_ex = None, None

and unlabeled data seems not be involved in the split_examples_evenly part as I could see.

Or I missed something in the code where the --split_examples_evenly can be applied to unlabeled data.

from pet.

timoschick avatar timoschick commented on August 17, 2024

Oh right, my mistake, you are absolutely correct!
For our AG's News results, we used an older version of the code (the corresponding file can still be found here). Back then, examples were always split evenly across all labels, so option (1) from my previous comment was not possible and option (2) was the default. When I wrote the current version of PET, I explicitly removed the num_examples_per_label option for unlabeled data because in a real-world setting, of course you do not have labels for unlabeled data so back then I felt like this was a sensible choice. But of course this means that with the current version of PET, option (2) from my previous comment is not possible anymore. So you'd have to either

  1. modify the code by removing the assertion and applying the if args.split_examples_evenly: [...] code block also to unlabeled examples or
  2. write a script that extracts the first 10,000 examples for each label and writes them to a separate file, and then use this separate file as input.

Sorry for the confusion!

from pet.

Punchwes avatar Punchwes commented on August 17, 2024

Thanks very much for this clarification, very helpful and it makes sense to remove the option for unlabelled data.

One last question I have is about seed. You mentioned in the paper that:

each model is trained three times using different seeds and average results are reported

After checking the code, it seems that the seed parameter passed by command line (args.seed) is not used to choose data examples,

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

seed in the load_examples function is fixed as 42:

def load_examples(task, data_dir: str, set_type: str, *_, num_examples: int = None,
                  num_examples_per_label: int = None, seed: int = 42) -> List[InputExample]:

So I wonder when you run the model 3 times with different seeds, do you also change the seed in load_example() manually?

from pet.

timoschick avatar timoschick commented on August 17, 2024

For our experiments in Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference, we use the same set of examples for all three runs. The different seeds only affect the initialization of model parameters (for regular supervised training), dropout and the shuffling of training examples (i.e., the order in which they are presented to the model), which happens here.

If you're interested in how different sets of training examples affect performance, you might find Table 6 in this paper useful.

from pet.

Punchwes avatar Punchwes commented on August 17, 2024

Thanks very much!

from pet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.