jhu-clsp / turking-bench Goto Github PK

Web-grounded natural language instructions

Home Page: https://turkingbench.github.io

License: Apache License 2.0

HTML 99.50% Python 0.48% CSS 0.01% JavaScript 0.01% Shell 0.01%

turking-bench's Introduction

TurkingBench: Challenging AI Agents with Web-based Tasks

This repository maintains a large collection of tasks and their natural language instructions that are grounded in the visual information.

Here are two example tasks:

You can also see a demo of the oracle model on this page: https://turkingbench.github.io/

Where can I see more tasks? You can see the instructions for each task here. Note, in this visualization, the variables are not filled in with any variables. During the evaluation, the variables are filled in with the input instances. We have prepared the necessary scripts for simulating the interaction with the templates (see below).

Note: The repository expects Python 3.10 or later.

Background

Why define tasks in natural language? While the current dominant paradigm (supervised learning with task-specific labeled examples) has been successful in building task-specific models, such models can't generalize to unseen tasks; for example, a model that is supervised to solve questions cannot solve a classification task. We hypothesize that a model equipped with understanding and reasoning with natural language instructions should be able to generalize to any task that can be defined in terms of natural language.

How is it tied to the past work? Most of the past work focus on raw-text task instructions. Here the instructions are multi-modal and grounded in the visual information.

How did we collect this data? We have collected about XX tasks that were originally created for crowdworkers. Each task comes with an HTML template template.html that contains the visual information and a natural language instruction. Additionally, the templates contain variables to be filled in by input instances maintained in batch.csv files.

Task schema

Each task consists of the following files:

template.html: This is the HTML template that is used to visualize the content of the task, including instructions, variables for visualizing the inputs, and HTML elements for collecting responses.
batch.csv: Contains the collection of inputs and outputs. The inputs are placed in the HTML template. The outputs are used to compute the performance of a model solving these tasks.

How to contribute

We welcome addition of more tasks too the collection! If intetested, please open a Pull-Request for your tasks.

Setting up the evaluation tasks

To facilitate the evaluation of models on this data, we have included the scripts needed to simulate interaction with the templates. Here are the steps you need to follow:

Install the dependencies: pip install -r requirements.txt. Then enter the src/ directory to for the rest of the steps.
Create a server for visualizing the tasks ./1_run_website.sh This will create a clone of Turkle server at http://localhost:8000 which is an engine for simulating Mechanical Turk locally. At this point you will see no tasks on Turkle; we will upload them in next step. If you see an error message that "Directory Turkle exists." remove this directory rm -rf Turkle and retry this step. If successful, you should be able to see the Turkle server running at http://localhost:8000 and you should be able to log in with the username and password you provided. At this point, Turkle will show "No Tasks available at this time". We will add the tasks in the next two steps.
Create input files for each task by running python 2_generate_input_csv.py. This will create a input.csv file for each task which we will be used for uploading the tasks to Turkle. You might ask why input.csv are necessary (they might seem like duplicates of batch.csv)? There are two key differences: (1) input.csv files only contain the * inputs* shown to crowdworkers (no labels). (2) input.csv files are a bit shorter than batch.csv files since they only contain the unique inputs (which is how Turkle expects).
Now open another terminal tab and run the script for copying the tasks to the server python 3_upload_tasks.py. While this script is running, you can go back to Turkle to see that the tasks are indeed being uploaded.

At this point, you should be able to see the tasks on Turkle. For example, if you open http://localhost:8000/task/4427/assignment/1/ you should be able to see the following visualization:

Interacting with the tasks and evaluating the oracle baselines

Here we brief description the interaction protocols. Specifically, this Python library here is meant to act as an intermediary between an AI system and AI system (see the picture). Every time that the AI system needs to interact with the task, it will make call the Python library, which is then executed on the browser. The Python library will then return the results back to the AI system.

Running the oracle model:

A quick way to see how the model execution is expected to look like is to run our oracle baseline which has access to the ground-truth labels. Run the script for evaluating the baseline by passing in the names of the tasks:

python3 4_run_evaluation.py --solver_type oracle  --tasks test_easy  --max_instance_count 20

This open a browser and show the execusion of the oracle model on the test tasks. Under the hood, the oracle model is generate a sequence of commands (Python commands from our action library) that ultimately get executed on each task (web page). The picture below shows this idea:

After this is done, the code will dump a JSON and a CSV file containing the results of the oracle model. If you want to run this faster without the visualization, you can set --headless flag.

Note: To use Chrome as your webdriver, you need to first download the ChromeDriver executable from the ChromeDriver website and make sure it’s in your system’s PATH.

The existing baseline models and evaluating them

You can find a list of baseine models here: baselines.py You can run these existing baselines by specifying the solver_type argument in the above script. You can also add your own models by adding a new class to the baselines.py file.

Optional: Dumping the input/output data for offline training/evaluation

There are scenarios that you may want to build a model that is trained on the input/output pairs of the tasks. But doing this is not efficient when you need to interact with thr browser for each task. As a result, we have created a functionality that allows you to dump the features of the tasks. The oracle model can be used to dump the training features that can be used for training other models. All need to be done is to set dump_features=True in the above script. You can find our script in src/5_dump_features.py that dumps the features for all tasks. The dumped features contain both visual content as well as the HTML content of the tasks. One can basically use either source of signals depending on the model. The output of these models will be strings (sequence of Python actions) that will be executed on the browser.

Upon training an offline training of the model, you can also generate predictions of your model and evaluate these predictions. This functionality is implemented in 4_run_evaluation.py by specifying the solver to be "offline_predictions". In this setting, you also will need to pass in the address of a file that contains the predictions of your model.

Citation

If you fnd this data useful, please cite this repository.

@article{turkingbench2024xu,
     title={Tur[k]ingBench: A Challenge Benchmark for Web Agents},
     author={Xu, Kevin and Kordi, Yeganeh and Sanders, Kate and Wang, Yizhong and Byerly, Adam and Zhang, Jack and Van Durme, Benjamin and Khashabi, Daniel},
     year={2024},
     eprint={2403.11905},
     url={https://arxiv.org/abs/2403.11905}
     archivePrefix={arXiv},
}

License

This work is licensed under Apache License 2.0.

turking-bench's People

Contributors

Stargazers

Watchers

Forkers

yeganehkordi katesanders9 klxu03 gosheni tanaynayak

turking-bench's Issues

Remove uncessary scripts

There are some scripts that are not needed anymore.
For example: https://github.com/JHU-CLSP/turk-instructions/blob/main/src/csv2json.py
Drop them?

<task_dir>

@yeganehkordi in the main readme clarify what <task_dir> is.

`Visual Comet Multiple Choice Test Verify` images

Some images seem to be missing. Need to double-check this. s

`Video MC Eval lsmdc_generative_finetune_mc_neg3` videos

This task has videos. We should download them and save them somewhere.

For image-based tasks ....

For image-based tasks, extract the data for image-based tasks if they go down at some point.

The landing page

Question: if we drop index.html the website would reflect the main readme README.md?

requirements.txt does not include packages needed to run ./1_run_website.sh

I have installed the packages in requirements.txt, but additional packages needed are django, djaa_list_filter, and others. Please update your requirements.txt file to match what is needed to run all respective packages.

Updating the list of tasks

Let's make sure these are updated:
https://github.com/JHU-CLSP/turk-instructions/tree/main/data

Website inspiration

Very nice website as an inspiration: https://dynalang.github.io/

"Reading comprehension" URLS

Consider downloading the content of the URLs in "reading comprehension" tasks so that we don't lose them in future.

`Compile list of area chairs`

The task should specify that in the same order that appear in the proceedings.

`MCN - Multiple Choice testset` images are missing

but I think they're part of CoCO dataset. So we should extract them from CoCo and upload them somewhere.

Add to the readme ...

How does the evaluation work in high-level
Where is the baseline call happens
How to run it with our trivial baseline with commandline

A diagram could also help clarify the process.

Parallel dumping

Need to revise this section based on #111:

https://github.com/JHU-CLSP/turk-instructions#dumping-the-training-features

Local visualization with backend

Create a simple Python-based server to visualize the tasks.

Related: We may also want to look into Turkle: https://github.com/hltcoe/turkle
It has examples (https://github.com/hltcoe/turkle/blob/master/examples/audio_transcription.html) with HTML+variables and somehow it can incorporate variables in HTML templates and visualize them.

`Relative CommonsenseQA Explanation Pairwise Judgements Collection 3`

@yeganehkordi There is Answer.equalbad in the spreadsheet file, however, it does not appear in the HTML file.

`VisualCOMET Selection test` images are missing

It seems that these links https://s3-us-west-2.amazonaws.com/ai2-rowanz/vcr1images/${img_fn}" are not working anymore. Is there any way that we can find the images?

`Commonsense Morality-Text Label Validate-Collect-Extended `

@katesanders9 The inputs tended to contain more explicit content, but the responses were often subjective in nature w.r.t. sensitive topics (such as racism, etc.) and could be tricky to include in a dataset where the aim is to emulate human responses without reviewers raising ethical concerns, I think.

`Scalar Adjectives Identification` seems like we're counting the number radio buttons in our stats.

`Author In-Group Analysis Phrase Classification 2` batch.csv file headers say `.on` which messes up our code

When evaluating "range" inputs, we use MSE. We need to normalize it by the maximum possible value of the range.

References to the eval files in the code

We need to revise the following lines too:
https://github.com/JHU-CLSP/turk-instructions/blob/main/src/evaluation_class.py#L261-L262

Change evaluation so that it receives a file?

What do you all think @yizhongw @yeganehkordi we change evaluation.py argument so that instead of task names, it receives the location of a file with task names? Then we can have a file for test tasks and another for train tasks

Remove uncessary columns in the batch files

Some batch files have many columns that are not in the HTML files. We need to drop them.

in `modify_select` need to skip `nan` values

Traceback (most recent call last):
  File "/home/runner/work/turk-instructions/turk-instructions/src/tests.py", line 25, in <module>
    test_evaluation()
  File "/home/runner/work/turk-instructions/turk-instructions/src/tests.py", line 21, in test_evaluation
    evaluation.enumerate_tasks(max_instance_count=1)
  File "/home/runner/work/turk-instructions/turk-instructions/src/4_run_evaluation.py", line 611, in enumerate_tasks
    oracle_action_sequence = self.solver.solve(i, **kwargs)
  File "/home/runner/work/turk-instructions/turk-instructions/src/evaluation/baselines.py", line 111, in solve
    action_sequence.append(self.actions.modify_select(input, answer))
  File "/home/runner/work/turk-instructions/turk-instructions/src/evaluation/actions.py", line 229, in modify_select
    elif ActionUtils.is_float(input_value) and str(int(input_value)) in option_values:
ValueError: cannot convert float NaN to integer
Error: Process completed with exit code 1.

https://github.com/JHU-CLSP/turk-instructions/actions/runs/6180036645/job/16775867745#step:4:2994

Make actions input_name string based instead of input based

Approve / Reject fields?

Do we we need the Approve / Reject fields in json files? If not, I think we can drop them.

Statistics

Let's collect statistics on

number of tasks
number of fields
distribution of the field / action types

CI: Check for missing images/files

Iterate over task and make sure that

all images/files are loaded
all URLS are accessible

Although it could be due to a permission issue since I am accessing it from my computer, rather than Mturk accessing it.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.