web-arena-x / visualwebarena Goto Github PK

View Code? Open in Web Editor NEW

201.0 201.0 34.0 22.13 MB

VisualWebArena is a benchmark for multimodal agents.

Home Page: https://jykoh.com/vwa

License: MIT License

Python 92.71% HTML 2.47% Shell 2.20% JavaScript 2.61%

agents llm multimodal

visualwebarena's People

Contributors

Stargazers

Watchers

visualwebarena's Issues

about CogVLM

hi,

When there are many images on a webpage, how do you handle the input length limit of CogVLM?

Thanks!

docker compose takes too much time, always fails here.

docker compose up --build -d

Wait for compose up to finish. This may take a while on the first launch as it downloads several large images from dockerhub.

Can we just download with SFTP or other tools instead?

Hi, I tried your demo agent on my web application and for most interactions it does well. However it seems to have trouble with identifying/clicking on checkboxes. Is there some way to perhaps improve or resolve that?

Blank screenshot when running GPT-4V + SOM

The screenshot seems problematic when I run the GPT-4V + SoM agent with the following flags:

python run.py \
  --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
  --test_start_idx 0 \
  --test_end_idx 1 \
  --result_dir <your_result_dir> \
  --test_config_base_dir=config_files/test_shopping \
  --model gpt-4-vision-preview \
  --action_set_tag som  --observation_type image_som

Here is part of the render_0.html:

The GPT response also shows that the image sent was empty.

PyTest fail on test_click_open_new_tab

Hi There,

I'm running into test failures when I run the pytest test suite.

Here is my error:


tests/test_browser_env/test_script_browser_env.py s.s.......F

============================================================ FAILURES ============================================================
____________________________________________________ test_click_open_new_tab _____________________________________________________

accessibility_tree_current_viewport_script_browser_env = <browser_env.envs.ScriptBrowserEnv object at 0x7f1e5af406d0>

    def test_click_open_new_tab(
        accessibility_tree_current_viewport_script_browser_env: ScriptBrowserEnv,
    ) -> None:
        env = accessibility_tree_current_viewport_script_browser_env
        env.reset()
        env.step(
            create_playwright_action(
                "page.goto('https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_win_open')"
            )
        )
        obs, *_, info = env.step(
            create_playwright_action(
                'page.frame_locator("iframe[name=\\"iframeResult\\"]").get_by_role("button", name="Try it").click()'
            )
        )
        print("TP")
        print(info["page"].url)
>       assert info["page"].url == "https://www.w3schools.com/"
E       AssertionError: assert 'https://www....sref_win_open' == 'https://www.w3schools.com/'
E         - https://www.w3schools.com/
E         + https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_win_open

tests/test_browser_env/test_script_browser_env.py:293: AssertionError
------------------------------------------------------ Captured stdout call ------------------------------------------------------
TP
https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_win_open

I see that there has been some activity this , and that this is actually a new test itself from #23

How can I resolve this?

Thanks

Questions about pytest -x

Performed the following installation operations as required

but

What are the multi-site tasks mentioned in the paper?

Where can I find them directly?

How to manually reset the shopping/reddit website?

Does it means to restart the docker?

Issue with URL links in classifieds tasks

Following the instructions you provided, the URL for classifieds is set to “http://localhost:9980/”. The home page loads correctly, but all the assets such as images, stylesheets, and links are on the “https://” URL instead of “http://”, which results in ERR_CONNECTION_CLOSED. As a result, we get only a blank page because the resources are not retrieved properly, causing the agent to fail the tasks.

Could you please release the code about how to run the CogVLM?

In the paper, you said you truncate the input text to 640 tokens for cogvlm. But it may dicard the important choices. Could you plz share how you construct the prompt for CogVLM?

action space of som can not solve the select element

There is no index in the selections

May I know any solution for using apptainer instead of docker for starting up website?

Thanks for open source such great work.
May I know how to use apptainer instead of docker for starting up website?

AttributeError("'Page' object has no attribute 'client'")

While running the evaluation for Classifieds (also for reddit), I get the error Page' object has no attribute 'client'. The stack trace is shown below (also happens for config_files/test_classifieds/211.json). Did you face this issue? Any suggestion to fix this is highly appreciated.

[Config file]: config_files/test_classifieds/117.json
[Unhandled Error] AttributeError("'Page' object has no attribute 'client'")
Traceback (most recent call last):
  File "/home/pahuja.9/visualwebarena/run.py", line 396, in test
    obs, _, terminated, _, info = env.step(action)
  File "/home/pahuja.9/visualwebarena/browser_env/envs.py", line 307, in step
    observation = self._get_obs()
  File "<@beartype(browser_env.envs.ScriptBrowserEnv._get_obs) at 0x7f1a3d76e200>", line 10, in _get_obs
  File "/home/pahuja.9/visualwebarena/browser_env/envs.py", line 226, in _get_obs
    self.page, self.get_page_client(self.page)
  File "<@beartype(browser_env.envs.ScriptBrowserEnv.get_page_client) at 0x7f1a3d76e050>", line 33, in get_page_client
  File "/home/pahuja.9/visualwebarena/browser_env/envs.py", line 221, in get_page_client
    return page.client  # type: ignore
AttributeError: 'Page' object has no attribute 'client'

[Question] How to view the generated .trace files for a trace?

Dear Authors:

Thanks for the great work!

Just want to check how can I view the generated trace given these files?

Thanks!

Possibility of altering evaluation sites

I would like to alter the sites given in the tar files with my own experiments added to the site in the form of HTML code. Is there an easy way to do this?

Assertion error in LLM based fuzzy match

for config_files/test_reddit/69.json, I get the following error in LLM based fuzzy match metric.

[Unhandled Error] AssertionError('n/a')
Traceback (most recent call last):
  File "/home/pahuja.9/visualwebarena/run.py", line 412, in test
    score = evaluator(
  File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 626, in __call__
    cur_score = evaluator(trajectory, config_file, page, client)
  File "<@beartype(evaluation_harness.evaluators.HTMLContentExactEvaluator.__call__) at 0x7f992c464790>", line 115, in __call__
  File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 472, in __call__
    StringEvaluator.fuzzy_match(
  File "<@beartype(evaluation_harness.evaluators.StringEvaluator.fuzzy_match) at 0x7f992c453e20>", line 69, in fuzzy_match
  File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 197, in fuzzy_match
    return llm_fuzzy_match(pred, ref, intent)
  File "<@beartype(evaluation_harness.helper_functions.llm_fuzzy_match) at 0x7f992c452cb0>", line 69, in llm_fuzzy_match
  File "/home/pahuja.9/visualwebarena/evaluation_harness/helper_functions.py", line 609, in llm_fuzzy_match
    assert "correct" in response, response
AssertionError: n/a

I am using the same LLM for fuzzy match as in the original code.

'devicePixelRatio is not 1.0' when running run_demo.py

Hi,

A browser opens and then just closes. Getting this message:

[Unhandled Error] AssertionError('devicePixelRatio is not 1.0')].

Could you advise?

Pytest code is not working

Hi team,

Thanks for releasing this interesting work.

I have a question about the uni test file (test_action_functionalities.py)

Ideally, it should parse some text like:"textbox' Full name'"

But actually, this is what we have from create_playwright_action.

This is the corresponding html content corresponding this part.

I follow readme to do the env setup. I am using Ubuntu22. playwright1.37.0. python=3.10.

Do you have any suggestion on this issue? Is it playwright version problem or browser version problem?

Thanks a lot for your help

Reproducing open-source model results

Hello,

I'm looking to reproduce some of the open-source model results from the VWA paper:
(1) Mixtral-8x7B model as the LLM backbone for Caption-augmented model
(2) CogVLM for the Multimodal Model.

Could someone share with me any flags/commands or instructions to setup these configurations for eval?

Release for multi-modal agents on arbitrary website

What is the tentative schedule for release of multi-modal agents that can perform an arbitrary websites?

Release log of success/failures for GPT4+SOM trajectories

Thanks to the authors for releasing the GPT4+SOM trajectories.

However, I do not see any way to find which traces correspond to succeeding tasks v/s failing tasks. Can this information be released as well?

This was done in the WebArena repository while releasing the GPT execution traces: https://github.com/web-arena-x/webarena/tree/main/resources#1132023-execution-traces-from-our-experiments-v2

Evaluation always skips last task for any website

The evaluation script https://github.com/web-arena-x/visualwebarena/blob/main/scripts/run_reddit_som.sh goes till index 208 and not till 209 as it should for 210 total examples. Similarly, https://github.com/web-arena-x/visualwebarena/blob/main/scripts/run_classifieds_som.sh goes till 232 and not 233 as it should for 234 examples. It is an easy fix, just want to confirm.

Archive.org Download Links Not Working

Hello,

I hope this message finds you well. I encountered an issue with the download links for the WebArena environment images hosted on Archive.org. The links appear to be broken and display an error message regarding metadata issues.

Affected Links:

Shopping Website Image
Wikipedia Website Image
Could you please look into this and provide updated links or fix the current ones?

Thank you very much for your assistance!

Errors in annotation

I found some errors in annotion.
In the classifieds_10:
sites: ['classifieds']
task_id: 10
require_login: True
storage_state: ./.auth/classifieds_state.json
start_url: http://localhost:9980
geolocation: None
intent_template: What is the {{attribute}} of {{item}}?
intent: What is the seat height in inches of the smaller piece of furniture on this page?
image: None
instantiation_dict: {'attribute': 'seat height in inches', 'item': 'the smaller piece of furniture on this page'}
require_reset: False
eval: {'eval_types': ['string_match'], 'reference_answers': {'exact_match': '21'}, 'reference_url': 'http://localhost:9980/index.php?page=item&id=43887', 'program_html': [], 'string_note': '', 'reference_answer_raw_annotation': ''}
reasoning_difficulty: easy
visual_difficulty: easy
overall_difficulty: easy
comments:
intent_template_id: 5

The output is 21 inches, which I think is correct.

In classified 142, the agent found wrong things in the trace of GPT4V. But it is evaluated as correct.

Elasticsearch / Opensearch issues on Shopping Website

Hello!

I'm having issues with the shopping website where items won't display in search or catalog. From admin panel, I believe this is due to the fact that one of the indexers is invalid, which in turn is due to opensearch / elasticsearch not working.

Testing elasticsearch in the admin panel throws the error "Class "" does not exist" even when localhost is running elasticsearch on 9200, or 'No Alive Nodes Found' when either elasticsearch or opensearch is used. I was thus hoping you could provide more information about how the search feature is configured within visualwebarena, as well as how one might link the two together.

It seems that the search text for city does not work for the classifieds?

Osclass Error

Hi, I ran the following commands in the environment readme to install classifieds environment, but encountered an OSClass Error:

unzip classifieds_docker_compose.zip
cd classifieds_docker_compose
vi classifieds_docker_compose/docker-compose.yml  # Set CLASSIFIEDS to your site url `http://<your-server-hostname>:9980/`, and change the reset token if required
docker compose up --build -d
# Wait for compose up to finish. This may take a while on the first launch as it downloads several large images from dockerhub.
docker exec classifieds_db mysql -u root -ppassword osclass -e 'source docker-entrypoint-initdb.d/osclass_craigslist.sql'  # Populate DB with content

Screenshot:

However, when I ran docker exec classifieds_db mysql -u root -ppassword osclass -e "SHOW TABLES;" to query the database tables, it seems good.

Could you help me resolve this? Thanks!

Configurations Setup for all Model Types

Hello,

Could you please share some of the configuration settings to reproduce the various model types?

I tried to reproduce the caption-augmented setup (Acc Tre + Caps) but my value was closer to the Multimodal result that had the Image Screenshot also as an input. Hoping I could get more clarification on how to switch between the 4 modes.

Here is my configurations

(1) Text-Only
observation_type: accessibility_tree
action_set_tag: id_accessibility_tree

(2) Caption-Augmented
observation_type: accessibility_tree_with_captioner
action_set_tag: id_accessibility_tree

(3) Multimodal
observation_type: ???
action_set_tag: id_accessibility_tree

(4) Multimodal (SoM)
observation_type: image_som
action_set_tag: som

How do you evaluate CogVLM when there are multiple images as input.

CogVLM only supports a single image as input. How do you evaluate CogVLM when there are multiple images as input. Thank you.

ImportError: cannot import name 'GITLAB' from 'browser_env.env_config'

When the dataset is "visualwebarena", it seems that there is no GITLAB in env_config

Search in Classifieds is too bad to be used as meaningful tests

For example, when searching for "blue kayak" http://ec2-3-13-232-171.us-east-2.compute.amazonaws.com:9980/index.php?page=search&sOrder=dt_pub_date&iOrderType=desc&sPattern=blue+kayak

The result shows nothing related to kayak at all (and none of them are red)

How to navigate to www.homepage.com?

Thank you for your work. I can come to the the homepage with http://127.0.0.1:4399 but not the www.homepage.com. Do I have to change the host of my server ot navigate to www.homepage.com?

Website Files Missing

Hello,

I was wondering if all of the website pages were included in the google drive downloads for the visual web arena environments set up? For One Stop shop it only displays a total of 24 items (and no items under the category tab), at least for me.

MySQL error with Classifieds website

Hi, I am trying to setup the classifieds website as outlined here https://github.com/web-arena-x/visualwebarena/blob/main/environment_docker/README.md#classifieds-website

When I execute docker exec classifieds_db mysql -u root -ppassword osclass -e 'source docker-entrypoint-initdb.d/osclass_craigslist.sql' # Populate DB with content, I get the error ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2).

Since docker is not affected by my local environment, it should run ideally. Kindly help me resolve this.

My docker_compose.yml is given below

version: '3.1'
  
services:
  web:
    image: jykoh/classifieds:latest
    ports:
      - "9980:9980"
    depends_on:
      - db
    container_name: classifieds
    environment:
      - CLASSIFIEDS=http://127.0.0.1:9980/
      - RESET_TOKEN=4b61655535e7ed388f0d40a93600254c
  db:
    image: mysql:8.1
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: password
      MYSQL_DATABASE: osclass
    volumes:
      - ./mysql:/docker-entrypoint-initdb.d
      - db_data:/var/lib/mysql
    container_name: classifieds_db

volumes:
  db_data: {}

Some examples will report an error when running run.py

'''
Processing config_files/test_classifieds\5.json
2024-03-28 22:25:05,874 - INFO - [Config file]: config_files/test_classifieds\5.json
2024-03-28 22:25:05,875 - INFO - [Intent]: Navigate to my listing of the white car and delete it.
2024-03-28 22:25:06,181 - INFO - [Unhandled Error] InvalidSchema("No connection adapters were found for '127.0.0.1:9980/index.php?page=reset'")]
Processing config_files/test_classifieds\6.json
2024-03-28 22:25:06,182 - INFO - [Config file]: config_files/test_classifieds\6.json
2024-03-28 22:25:06,183 - INFO - [Intent]: Return the links of the 3 most recent motorcycles within $1000 to $2000 that are not orange.
Start testing config_files/test_classifieds\6.json
Finish testing config_files/test_classifieds\6.json
2024-03-28 22:27:15,567 - INFO - [Unhandled Error] LookupError("\n**********************************************************************\n Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt/english.pickle\x1b[0m\n\n Searched in:\n - 'C:\\Users\\PS/nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\share\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\lib\\nltk_data'\n - 'C:\\Users\\PS\\AppData\\Roaming\\nltk_data'\n - 'C:\\nltk_data'\n - 'D:\\nltk_data'\n - 'E:\\nltk_data'\n - ''\n**********************************************************************\n")]
Processing config_files/test_classifieds\7.json
2024-03-28 22:27:15,570 - INFO - [Config file]: config_files/test_classifieds\7.json
2024-03-28 22:27:15,570 - INFO - [Intent]: Return the links of the 2 most recent items in the "Cell phones" category within $300 to $600 that are white in color.
Start testing config_files/test_classifieds\7.json
Finish testing config_files/test_classifieds\7.json
2024-03-28 22:28:47,585 - INFO - [Unhandled Error] LookupError("\n**********************************************************************\n Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt/english.pickle\x1b[0m\n\n Searched in:\n - 'C:\\Users\\PS/nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\share\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\lib\\nltk_data'\n - 'C:\\Users\\PS\\AppData\\Roaming\\nltk_data'\n - 'C:\\nltk_data'\n - 'D:\\nltk_data'\n - 'E:\\nltk_data'\n - ''\n**********************************************************************\n")]
'''

AssertionError: Cookie ./.auth/reddit_state.json expired.

Thank you for your work. I reset the reddit by
bash ./scripts/reset_redddit.sh
Then run the prepare.sh, but alway meet the following error:

Traceback (most recent call last):
  File "code/vwa/browser_env/auto_login.py", line 182, in <module>
    main()
  File "code/vwa/browser_env/auto_login.py", line 173, in main
    assert not future.result(), f"Cookie {cookie_files[i]} expired."
AssertionError: Cookie ./.auth/reddit_state.json expired.

How to open trace files without display server?

I am trying to show the trace of one of the trace files, 463.trace.zip. This is the command I am using:

unzip 463.trace.zip -d 463_trace
xvfb-run playwright show-trace 463_trace

It has been sitting for several hours. Is this expected, or is there a better way to extract the trace?
Thanks!

Shopping Website Issues with setup:store-config:set

I've been able to host the shopping website successfully but noticed that running scripts/reset_shopping.sh causes the shopping website to clear all items not on the homepage, i.e. all the categories no longer have items. Specifically, it appears the command
docker exec $CONTAINER_NAME /var/www/magento2/bin/magento setup:store-config:set --base-url="http://localhost:7770" # no trailing slash
is causing this issue.

I was wondering what this command does in the context of the repo, and if it's safe to remove? If not, do you know what about this command could be causing the issue? I am currently hosting the shopping website on http://127.0.0.1:7770.

Multiple unreachable image ulrs

On the Classifieds and Reddit tasks, there are multiple image links that do not exist. An example of such errors is as follows:
L616 WARNING: cannot identify image file <_io.BytesIO object at 0x7f8b5c2f1ee0>

Why you have to remove the dupulicated content.

Why you have to remove the dupulicated content. Does this filter out cases where two elements with the same text?

           if content in text_content_text:
                        # Remove text_content_elements with content
                        text_content_elements = [
                            element
                            for element in text_content_elements
                            if element.strip() != content
                        ]

web-arena-x / visualwebarena Goto Github PK

visualwebarena's People

Contributors

Stargazers

Watchers

Forkers

visualwebarena's Issues

Wait for compose up to finish. This may take a while on the first launch as it downloads several large images from dockerhub.

Recommend Projects

Recommend Topics

Recommend Org