tau-nlp / scrolls Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 8.0 54 KB

The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".

Home Page: https://www.scrolls-benchmark.com/

License: MIT License

Python 95.16% TeX 4.84%

benchmark long-texts natural-language-understanding

scrolls's People

Contributors

Stargazers

Watchers

Forkers

dragomirradev soeque1 chyrinetahri leonard907 lzhou1998 davidfant grndnl

scrolls's Issues

Lengths of inputs and outputs

From the paper, you seem to truncate inputs to 16,384 tokens for your leaderboard, is that right?
As n-gram metrics are affected by the length of outputs, how do you determine the target length of outputs? I notice that the default max_target_length in baselines/src/run.py is 128 tokens. Do you train your models with an EOS token such that the generated output may terminate much earlier?

I have trained a baseline model and run prediction on validation split according to the instructions in the baseline README. However, the command line output didn't seem to give me a destination folder that contains the generated predictions after running through the validation dataset. I was hoping to find a JSON file containing the validation split predictions so that I can use that in the evaluator. Is there a way that I can find the validation split predictions?

Moreover, is there a way for me to evaluate the results on the test split? I see the README in the evaluator folder which has options
Evaluate predictions for a single dataset (validation only)
Evaluate predictions for the entire benchmark (validation only)
Prepare Submission File
Verify Submission File
I want to evaluate the metrics on the test dataset (to see if the resulting numbers match the paper), but I don't want to generate a submission file since I'm just running the baseline models. Is there a way to do that? Thank you very much!

EDIT: I'm currently only running the QMSUM dataset, not the others.

Prompts for tasks

For the your tasks, are the "input" column the full input passed into the models? Did you add any additional prompting for the models provided in the leaderboard?

For example, for GovReport, did you (or were the teams who submitted allowed to) do something like the following?

Original Text:
<"input" column of GovReport>

Summary:
<"output" column of GovReport / output of model>

Or is there no additional prompting:

<"input" column of GovReport>
<"output" column of GovReport / output of model>

Predict command fails

First I want to thank the authors for this great work! I might find it useful for my research.

I encountered 3 problems:

in evaluator/dataset_evaluator.py, in the usage of hf_hub_download, I got an exception because it was used that way: hf_hub_download(repo_id="datasets/tau/scrolls", filename="metrics/scrolls.py") instead of hf_hub_download(repo_id="tau/scrolls", filename="metrics/scrolls.py", repo_type="dataset"). I don't know why it worked for you, perhaps there was a breaking change in the datasets library recently. Would you want me to open a PR for that?
The generate script (python scripts/execute.py scripts/commands/generate.py {dataset}_{model}_{split} --checkpoint_path path/to/model/folder) took a very long time, much more than the fine-tuning of 256-bart. There was a warning that might be related saying:

Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector

Edit: I now noticed that this warning is emitted only when I use more than one GPU. However, it is still slower than expected.
3. It failed with the following exception:

Traceback (most recent call last):
File "/home/liranringel/scrolls/baselines/scripts/execute.py", line 53, in
main(command_dict, unknown)
File "/home/liranringel/scrolls/baselines/scripts/execute.py", line 33, in main
runpy.run_module(module_name, run_name="main")
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/runpy.py", line 228, in run_module
return _run_code(code, {}, init_globals, run_name, mod_spec)
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/liranringel/scrolls/baselines/src/run.py", line 789, in
main()
File "/home/liranringel/scrolls/baselines/src/run.py", line 656, in main
metrics = trainer.evaluate(metric_key_prefix="eval")
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/site-packages/transformers/trainer_seq2seq.py", line 131, in evaluate
eval_preds = self._post_process_function(untokenized_eval_dataset, eval_loop_output.predictions)
File "/home/liranringel/miniconda3/envs/mem/lib/python3.9/site-packages/transformers/trainer_seq2seq.py", line 326, in _post_process_function
assert len(untokenized_eval_dataset) == len(self.eval_dataset)
AssertionError

QuALITY validation result of LED

I try to get the validation result of LED in QuALITY. After running your code, I get the following results.

1024 27.9003
4096 23.9693
16384 20.326

Those results are very bad. Are those results consistent with what you have?

Thanks.

Issues with downloading metrics from Huggingface Hub

Hi,

I am having a silly issue with the following line in metrics.py:
scrolls_metric_path = hf_hub_download(repo_id="datasets/tau/scrolls", filename="metrics/scrolls.py")

I am getting the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsadhukh/anaconda3/envs/llm_faiss/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/rsadhukh/anaconda3/envs/llm_faiss/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'datasets/tau/scrolls'. Use `repo_type` argument if needed.

Is there any workaround? Thanks in advance.

Never mind. Found a work around.

Using custom dataset for scrolls

I want to use a custom dataset to finetune the models. The dataset is identical to the gov_report dataset except that the input is filtered by a content selection algorithm, meaning that the input in the custom dataset will only be part of the original input. Since there are commands that need to be run to prepare the dataset, I wonder what are the steps that I should do in order to run scrolls with the custom dataset? The dataset can be found here: https://huggingface.co/datasets/learn3r/gov_report_oreo

Prediction for Qasper test data fails

Hello,

I'm trying to replicate the fine-tuning results for the Qasper dataset baseline and the 256-bart model.

I see two issues when I try to generate predictions:

When generating predictions on the Qasper validation data, there are only 984 samples loaded, instead of the 1,726 stated in the paper and found in the dataset itself. This is the command I'm running:

python scripts/execute.py scripts/commands/generate.py qasper_256-bart_validation --checkpoint_path /home/ubuntu/baselines/outputs/facebook-bart-base_256_1_5e-05_16384_scrolls_qasper_site-wash-14

When generating prediction for the test data, the script errors out, however there should be 1,399 examples:

  File "/home/ubuntu/baselines/src/run.py", line 689, in main
    id_to_prediction[instance["id"]] = predict_results.predictions[i]
IndexError: index 984 is out of bounds for axis 0 with size 984

This is the command I'm using:

python scripts/execute.py scripts/commands/generate.py qasper_256-bart_test --checkpoint_path /home/ubuntu/baselines/outputs/facebook-bart-base_256_1_5e-05_16384_scrolls_qasper_site-wash-14

Could you please advise?
Thanks!

Load only hard examples on QuALITY

Is there a way to load only the hard examples for QuALITY?

tau-nlp / scrolls Goto Github PK

scrolls's People

Contributors

Stargazers

Watchers

Forkers

scrolls's Issues

Lengths of inputs and outputs

Evaluating the results

Prompts for tasks

Predict command fails

QuALITY validation result of LED

Issues with downloading metrics from Huggingface Hub

Using custom dataset for scrolls

Prediction for Qasper test data fails

Load only hard examples on QuALITY

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent