Giter VIP home page Giter VIP logo

scrolls's People

Contributors

eladsegal avatar urisha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrolls's Issues

Lengths of inputs and outputs

  1. From the paper, you seem to truncate inputs to 16,384 tokens for your leaderboard, is that right?
  2. As n-gram metrics are affected by the length of outputs, how do you determine the target length of outputs? I notice that the default max_target_length in baselines/src/run.py is 128 tokens. Do you train your models with an EOS token such that the generated output may terminate much earlier?

Evaluating the results

I have trained a baseline model and run prediction on validation split according to the instructions in the baseline README. However, the command line output didn't seem to give me a destination folder that contains the generated predictions after running through the validation dataset. I was hoping to find a JSON file containing the validation split predictions so that I can use that in the evaluator. Is there a way that I can find the validation split predictions?

Moreover, is there a way for me to evaluate the results on the test split? I see the README in the evaluator folder which has options
Evaluate predictions for a single dataset (validation only)
Evaluate predictions for the entire benchmark (validation only)
Prepare Submission File
Verify Submission File
I want to evaluate the metrics on the test dataset (to see if the resulting numbers match the paper), but I don't want to generate a submission file since I'm just running the baseline models. Is there a way to do that? Thank you very much!


EDIT: I'm currently only running the QMSUM dataset, not the others.

Prompts for tasks

For the your tasks, are the "input" column the full input passed into the models? Did you add any additional prompting for the models provided in the leaderboard?

For example, for GovReport, did you (or were the teams who submitted allowed to) do something like the following?

Original Text:
<"input" column of GovReport>

Summary:
<"output" column of GovReport / output of model>

Or is there no additional prompting:

<"input" column of GovReport>
<"output" column of GovReport / output of model>

Predict command fails

First I want to thank the authors for this great work! I might find it useful for my research.

I encountered 3 problems:

  1. in evaluator/dataset_evaluator.py, in the usage of hf_hub_download, I got an exception because it was used that way: hf_hub_download(repo_id="datasets/tau/scrolls", filename="metrics/scrolls.py") instead of hf_hub_download(repo_id="tau/scrolls", filename="metrics/scrolls.py", repo_type="dataset"). I don't know why it worked for you, perhaps there was a breaking change in the datasets library recently. Would you want me to open a PR for that?
  2. The generate script (python scripts/execute.py scripts/commands/generate.py {dataset}_{model}_{split} --checkpoint_path path/to/model/folder) took a very long time, much more than the fine-tuning of 256-bart. There was a warning that might be related saying:

Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector

Edit: I now noticed that this warning is emitted only when I use more than one GPU. However, it is still slower than expected.
3. It failed with the following exception:

Traceback (most recent call last):
File "/home/liranringel/scrolls/baselines/scripts/execute.py", line 53, in
main(command_dict, unknown)
File "/home/liranringel/scrolls/baselines/scripts/execute.py", line 33, in main
runpy.run_module(module_name, run_name="main")
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/runpy.py", line 228, in run_module
return _run_code(code, {}, init_globals, run_name, mod_spec)
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/liranringel/scrolls/baselines/src/run.py", line 789, in
main()
File "/home/liranringel/scrolls/baselines/src/run.py", line 656, in main
metrics = trainer.evaluate(metric_key_prefix="eval")
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/site-packages/transformers/trainer_seq2seq.py", line 131, in evaluate
eval_preds = self._post_process_function(untokenized_eval_dataset, eval_loop_output.predictions)
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/site-packages/transformers/trainer_seq2seq.py", line 326, in _post_process_function
assert len(untokenized_eval_dataset) == len(self.eval_dataset)
AssertionError

QuALITY validation result of LED

I try to get the validation result of LED in QuALITY. After running your code, I get the following results.

1024 27.9003
4096 23.9693
16384 20.326

Those results are very bad. Are those results consistent with what you have?

Thanks.

Issues with downloading metrics from Huggingface Hub

Hi,

I am having a silly issue with the following line in metrics.py:
scrolls_metric_path = hf_hub_download(repo_id="datasets/tau/scrolls", filename="metrics/scrolls.py")

I am getting the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsadhukh/anaconda3/envs/llm_faiss/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/rsadhukh/anaconda3/envs/llm_faiss/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'datasets/tau/scrolls'. Use `repo_type` argument if needed.

Is there any workaround? Thanks in advance.

Never mind. Found a work around.

Using custom dataset for scrolls

I want to use a custom dataset to finetune the models. The dataset is identical to the gov_report dataset except that the input is filtered by a content selection algorithm, meaning that the input in the custom dataset will only be part of the original input. Since there are commands that need to be run to prepare the dataset, I wonder what are the steps that I should do in order to run scrolls with the custom dataset? The dataset can be found here: https://huggingface.co/datasets/learn3r/gov_report_oreo

Prediction for Qasper test data fails

Hello,

I'm trying to replicate the fine-tuning results for the Qasper dataset baseline and the 256-bart model.

I see two issues when I try to generate predictions:

  1. When generating predictions on the Qasper validation data, there are only 984 samples loaded, instead of the 1,726 stated in the paper and found in the dataset itself. This is the command I'm running:
python scripts/execute.py scripts/commands/generate.py qasper_256-bart_validation --checkpoint_path /home/ubuntu/baselines/outputs/facebook-bart-base_256_1_5e-05_16384_scrolls_qasper_site-wash-14
  1. When generating prediction for the test data, the script errors out, however there should be 1,399 examples:
  File "/home/ubuntu/baselines/src/run.py", line 689, in main
    id_to_prediction[instance["id"]] = predict_results.predictions[i]
IndexError: index 984 is out of bounds for axis 0 with size 984

This is the command I'm using:

python scripts/execute.py scripts/commands/generate.py qasper_256-bart_test --checkpoint_path /home/ubuntu/baselines/outputs/facebook-bart-base_256_1_5e-05_16384_scrolls_qasper_site-wash-14

Could you please advise?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.