Giter VIP home page Giter VIP logo

turking-bench's Introduction

TurkingBench: Challenging AI Agents with Web-based Tasks


This repository maintains a large collection of tasks and their natural language instructions that are grounded in the visual information.

Here are two example tasks: Screen Shot 2023-02-20 at 12 21 22 PM

You can also see a demo of the oracle model on this page: https://turkingbench.github.io/

Where can I see more tasks? You can see the instructions for each task here. Note, in this visualization, the variables are not filled in with any variables. During the evaluation, the variables are filled in with the input instances. We have prepared the necessary scripts for simulating the interaction with the templates (see below).

Note: The repository expects Python 3.10 or later.

Background

Why define tasks in natural language? While the current dominant paradigm (supervised learning with task-specific labeled examples) has been successful in building task-specific models, such models can't generalize to unseen tasks; for example, a model that is supervised to solve questions cannot solve a classification task. We hypothesize that a model equipped with understanding and reasoning with natural language instructions should be able to generalize to any task that can be defined in terms of natural language.

How is it tied to the past work? Most of the past work focus on raw-text task instructions. Here the instructions are multi-modal and grounded in the visual information.

How did we collect this data? We have collected about XX tasks that were originally created for crowdworkers. Each task comes with an HTML template template.html that contains the visual information and a natural language instruction. Additionally, the templates contain variables to be filled in by input instances maintained in batch.csv files.

Task schema

Each task consists of the following files:

  • template.html: This is the HTML template that is used to visualize the content of the task, including instructions, variables for visualizing the inputs, and HTML elements for collecting responses.
  • batch.csv: Contains the collection of inputs and outputs. The inputs are placed in the HTML template. The outputs are used to compute the performance of a model solving these tasks.

How to contribute

We welcome addition of more tasks too the collection! If intetested, please open a Pull-Request for your tasks.

Setting up the evaluation tasks

To facilitate the evaluation of models on this data, we have included the scripts needed to simulate interaction with the templates. Here are the steps you need to follow:

  1. Install the dependencies: pip install -r requirements.txt. Then enter the src/ directory to for the rest of the steps.
  2. Create a server for visualizing the tasks ./1_run_website.sh This will create a clone of Turkle server at http://localhost:8000 which is an engine for simulating Mechanical Turk locally. At this point you will see no tasks on Turkle; we will upload them in next step. If you see an error message that "Directory Turkle exists." remove this directory rm -rf Turkle and retry this step. If successful, you should be able to see the Turkle server running at http://localhost:8000 and you should be able to log in with the username and password you provided. At this point, Turkle will show "No Tasks available at this time". We will add the tasks in the next two steps.
  3. Create input files for each task by running python 2_generate_input_csv.py. This will create a input.csv file for each task which we will be used for uploading the tasks to Turkle. You might ask why input.csv are necessary (they might seem like duplicates of batch.csv)? There are two key differences: (1) input.csv files only contain the * inputs* shown to crowdworkers (no labels). (2) input.csv files are a bit shorter than batch.csv files since they only contain the unique inputs (which is how Turkle expects).
  4. Now open another terminal tab and run the script for copying the tasks to the server python 3_upload_tasks.py. While this script is running, you can go back to Turkle to see that the tasks are indeed being uploaded.

At this point, you should be able to see the tasks on Turkle. For example, if you open http://localhost:8000/task/4427/assignment/1/ you should be able to see the following visualization:

Screenshot

Interacting with the tasks and evaluating the oracle baselines

Here we brief description the interaction protocols. Specifically, this Python library here is meant to act as an intermediary between an AI system and AI system (see the picture). Every time that the AI system needs to interact with the task, it will make call the Python library, which is then executed on the browser. The Python library will then return the results back to the AI system.

Running the oracle model:

A quick way to see how the model execution is expected to look like is to run our oracle baseline which has access to the ground-truth labels. Run the script for evaluating the baseline by passing in the names of the tasks:

python3 4_run_evaluation.py --solver_type oracle  --tasks test_easy  --max_instance_count 20

This open a browser and show the execusion of the oracle model on the test tasks. Under the hood, the oracle model is generate a sequence of commands (Python commands from our action library) that ultimately get executed on each task (web page). The picture below shows this idea:

Screen Shot 2023-02-20 at 12 22 37 PM

After this is done, the code will dump a JSON and a CSV file containing the results of the oracle model. If you want to run this faster without the visualization, you can set --headless flag.

Note: To use Chrome as your webdriver, you need to first download the ChromeDriver executable from the ChromeDriver website and make sure it’s in your system’s PATH.

The existing baseline models and evaluating them

You can find a list of baseine models here: baselines.py You can run these existing baselines by specifying the solver_type argument in the above script. You can also add your own models by adding a new class to the baselines.py file.

Optional: Dumping the input/output data for offline training/evaluation

There are scenarios that you may want to build a model that is trained on the input/output pairs of the tasks. But doing this is not efficient when you need to interact with thr browser for each task. As a result, we have created a functionality that allows you to dump the features of the tasks. The oracle model can be used to dump the training features that can be used for training other models. All need to be done is to set dump_features=True in the above script. You can find our script in src/5_dump_features.py that dumps the features for all tasks. The dumped features contain both visual content as well as the HTML content of the tasks. One can basically use either source of signals depending on the model. The output of these models will be strings (sequence of Python actions) that will be executed on the browser.

Upon training an offline training of the model, you can also generate predictions of your model and evaluate these predictions. This functionality is implemented in 4_run_evaluation.py by specifying the solver to be "offline_predictions". In this setting, you also will need to pass in the address of a file that contains the predictions of your model.

Citation

If you fnd this data useful, please cite this repository.

@article{turkingbench2024xu,
     title={Tur[k]ingBench: A Challenge Benchmark for Web Agents},
     author={Xu, Kevin and Kordi, Yeganeh and Sanders, Kate and Wang, Yizhong and Byerly, Adam and Zhang, Jack and Van Durme, Benjamin and Khashabi, Daniel},
     year={2024},
     eprint={2403.11905},
     url={https://arxiv.org/abs/2403.11905}
     archivePrefix={arXiv},
}

License

This work is licensed under Apache License 2.0.

turking-bench's People

Contributors

klxu03 avatar danyaljj avatar yeganehkordi avatar yizhongw avatar abyerly2jh avatar gosheni avatar katesanders9 avatar

Stargazers

Vincent avatar Zekun Wang avatar Fan avatar Shihao Liang avatar  avatar  avatar Rui Shao avatar Shuyan Zhou avatar Boyuan Zheng avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

turking-bench's Issues

The landing page

Question: if we drop index.html the website would reflect the main readme README.md?

"Reading comprehension" URLS

Consider downloading the content of the URLs in "reading comprehension" tasks so that we don't lose them in future.

Add to the readme ...

  • How does the evaluation work in high-level
  • Where is the baseline call happens
  • How to run it with our trivial baseline with commandline

A diagram could also help clarify the process.

`Commonsense Morality-Text Label Validate-Collect-Extended `

@katesanders9 The inputs tended to contain more explicit content, but the responses were often subjective in nature w.r.t. sensitive topics (such as racism, etc.) and could be tricky to include in a dataset where the aim is to emulate human responses without reviewers raising ethical concerns, I think.

in `modify_select` need to skip `nan` values

Traceback (most recent call last):
  File "/home/runner/work/turk-instructions/turk-instructions/src/tests.py", line 25, in <module>
    test_evaluation()
  File "/home/runner/work/turk-instructions/turk-instructions/src/tests.py", line 21, in test_evaluation
    evaluation.enumerate_tasks(max_instance_count=1)
  File "/home/runner/work/turk-instructions/turk-instructions/src/4_run_evaluation.py", line 611, in enumerate_tasks
    oracle_action_sequence = self.solver.solve(i, **kwargs)
  File "/home/runner/work/turk-instructions/turk-instructions/src/evaluation/baselines.py", line 111, in solve
    action_sequence.append(self.actions.modify_select(input, answer))
  File "/home/runner/work/turk-instructions/turk-instructions/src/evaluation/actions.py", line 229, in modify_select
    elif ActionUtils.is_float(input_value) and str(int(input_value)) in option_values:
ValueError: cannot convert float NaN to integer
Error: Process completed with exit code 1.

https://github.com/JHU-CLSP/turk-instructions/actions/runs/6180036645/job/16775867745#step:4:2994

Statistics

Let's collect statistics on

  • number of tasks
  • number of fields
  • distribution of the field / action types

'ATOMIC - NL Rephrase 16'

When I built Turkle, this task was not populated. We should double-check this and see if there is any issues with the template or the inputs.

mturk.html vs sandbox.html

These two files are almost identical, except for a few places.
It might be cleaner if we can merge them into one file that can represent both.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.