web-arena-x / visualwebarena Goto Github PK
View Code? Open in Web Editor NEWVisualWebArena is a benchmark for multimodal agents.
Home Page: https://jykoh.com/vwa
License: MIT License
VisualWebArena is a benchmark for multimodal agents.
Home Page: https://jykoh.com/vwa
License: MIT License
hi,
When there are many images on a webpage, how do you handle the input length limit of CogVLM?
Thanks!
docker compose up --build -d
Can we just download with SFTP or other tools instead?
The screenshot seems problematic when I run the GPT-4V + SoM agent with the following flags:
python run.py \
--instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
--test_start_idx 0 \
--test_end_idx 1 \
--result_dir <your_result_dir> \
--test_config_base_dir=config_files/test_shopping \
--model gpt-4-vision-preview \
--action_set_tag som --observation_type image_som
Here is part of the render_0.html:
The GPT response also shows that the image sent was empty.
Hi There,
I'm running into test failures when I run the pytest test suite.
Here is my error:
tests/test_browser_env/test_script_browser_env.py s.s.......F
============================================================ FAILURES ============================================================
____________________________________________________ test_click_open_new_tab _____________________________________________________
accessibility_tree_current_viewport_script_browser_env = <browser_env.envs.ScriptBrowserEnv object at 0x7f1e5af406d0>
def test_click_open_new_tab(
accessibility_tree_current_viewport_script_browser_env: ScriptBrowserEnv,
) -> None:
env = accessibility_tree_current_viewport_script_browser_env
env.reset()
env.step(
create_playwright_action(
"page.goto('https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_win_open')"
)
)
obs, *_, info = env.step(
create_playwright_action(
'page.frame_locator("iframe[name=\\"iframeResult\\"]").get_by_role("button", name="Try it").click()'
)
)
print("TP")
print(info["page"].url)
> assert info["page"].url == "https://www.w3schools.com/"
E AssertionError: assert 'https://www....sref_win_open' == 'https://www.w3schools.com/'
E - https://www.w3schools.com/
E + https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_win_open
tests/test_browser_env/test_script_browser_env.py:293: AssertionError
------------------------------------------------------ Captured stdout call ------------------------------------------------------
TP
https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_win_open
I see that there has been some activity this , and that this is actually a new test itself from #23
How can I resolve this?
Thanks
Does it means to restart the docker?
Following the instructions you provided, the URL for classifieds is set to “http://localhost:9980/”. The home page loads correctly, but all the assets such as images, stylesheets, and links are on the “https://” URL instead of “http://”, which results in ERR_CONNECTION_CLOSED. As a result, we get only a blank page because the resources are not retrieved properly, causing the agent to fail the tasks.
In the paper, you said you truncate the input text to 640 tokens for cogvlm. But it may dicard the important choices. Could you plz share how you construct the prompt for CogVLM?
Thanks for open source such great work.
May I know how to use apptainer instead of docker for starting up website?
While running the evaluation for Classifieds (also for reddit), I get the error Page' object has no attribute 'client'
. The stack trace is shown below (also happens for config_files/test_classifieds/211.json). Did you face this issue? Any suggestion to fix this is highly appreciated.
[Config file]: config_files/test_classifieds/117.json
[Unhandled Error] AttributeError("'Page' object has no attribute 'client'")
Traceback (most recent call last):
File "/home/pahuja.9/visualwebarena/run.py", line 396, in test
obs, _, terminated, _, info = env.step(action)
File "/home/pahuja.9/visualwebarena/browser_env/envs.py", line 307, in step
observation = self._get_obs()
File "<@beartype(browser_env.envs.ScriptBrowserEnv._get_obs) at 0x7f1a3d76e200>", line 10, in _get_obs
File "/home/pahuja.9/visualwebarena/browser_env/envs.py", line 226, in _get_obs
self.page, self.get_page_client(self.page)
File "<@beartype(browser_env.envs.ScriptBrowserEnv.get_page_client) at 0x7f1a3d76e050>", line 33, in get_page_client
File "/home/pahuja.9/visualwebarena/browser_env/envs.py", line 221, in get_page_client
return page.client # type: ignore
AttributeError: 'Page' object has no attribute 'client'
I would like to alter the sites given in the tar files with my own experiments added to the site in the form of HTML code. Is there an easy way to do this?
for config_files/test_reddit/69.json, I get the following error in LLM based fuzzy match metric.
[Unhandled Error] AssertionError('n/a')
Traceback (most recent call last):
File "/home/pahuja.9/visualwebarena/run.py", line 412, in test
score = evaluator(
File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 626, in __call__
cur_score = evaluator(trajectory, config_file, page, client)
File "<@beartype(evaluation_harness.evaluators.HTMLContentExactEvaluator.__call__) at 0x7f992c464790>", line 115, in __call__
File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 472, in __call__
StringEvaluator.fuzzy_match(
File "<@beartype(evaluation_harness.evaluators.StringEvaluator.fuzzy_match) at 0x7f992c453e20>", line 69, in fuzzy_match
File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 197, in fuzzy_match
return llm_fuzzy_match(pred, ref, intent)
File "<@beartype(evaluation_harness.helper_functions.llm_fuzzy_match) at 0x7f992c452cb0>", line 69, in llm_fuzzy_match
File "/home/pahuja.9/visualwebarena/evaluation_harness/helper_functions.py", line 609, in llm_fuzzy_match
assert "correct" in response, response
AssertionError: n/a
I am using the same LLM for fuzzy match as in the original code.
Hi,
A browser opens and then just closes. Getting this message:
[Unhandled Error] AssertionError('devicePixelRatio is not 1.0')].
Could you advise?
Hi team,
Thanks for releasing this interesting work.
I have a question about the uni test file (test_action_functionalities.py)
Ideally, it should parse some text like:"textbox' Full name'"
But actually, this is what we have from create_playwright_action.
This is the corresponding html content corresponding this part.
I follow readme to do the env setup. I am using Ubuntu22. playwright1.37.0. python=3.10.
Do you have any suggestion on this issue? Is it playwright version problem or browser version problem?
Thanks a lot for your help
Hello,
I'm looking to reproduce some of the open-source model results from the VWA paper:
(1) Mixtral-8x7B model as the LLM backbone for Caption-augmented model
(2) CogVLM for the Multimodal Model.
Could someone share with me any flags/commands or instructions to setup these configurations for eval?
What is the tentative schedule for release of multi-modal agents that can perform an arbitrary websites?
Thanks to the authors for releasing the GPT4+SOM trajectories.
However, I do not see any way to find which traces correspond to succeeding tasks v/s failing tasks. Can this information be released as well?
This was done in the WebArena repository while releasing the GPT execution traces: https://github.com/web-arena-x/webarena/tree/main/resources#1132023-execution-traces-from-our-experiments-v2
The evaluation script https://github.com/web-arena-x/visualwebarena/blob/main/scripts/run_reddit_som.sh goes till index 208 and not till 209 as it should for 210 total examples. Similarly, https://github.com/web-arena-x/visualwebarena/blob/main/scripts/run_classifieds_som.sh goes till 232 and not 233 as it should for 234 examples. It is an easy fix, just want to confirm.
Hello,
I hope this message finds you well. I encountered an issue with the download links for the WebArena environment images hosted on Archive.org. The links appear to be broken and display an error message regarding metadata issues.
Affected Links:
Shopping Website Image
Wikipedia Website Image
Could you please look into this and provide updated links or fix the current ones?
Thank you very much for your assistance!
I found some errors in annotion.
In the classifieds_10:
sites: ['classifieds']
task_id: 10
require_login: True
storage_state: ./.auth/classifieds_state.json
start_url: http://localhost:9980
geolocation: None
intent_template: What is the {{attribute}} of {{item}}?
intent: What is the seat height in inches of the smaller piece of furniture on this page?
image: None
instantiation_dict: {'attribute': 'seat height in inches', 'item': 'the smaller piece of furniture on this page'}
require_reset: False
eval: {'eval_types': ['string_match'], 'reference_answers': {'exact_match': '21'}, 'reference_url': 'http://localhost:9980/index.php?page=item&id=43887', 'program_html': [], 'string_note': '', 'reference_answer_raw_annotation': ''}
reasoning_difficulty: easy
visual_difficulty: easy
overall_difficulty: easy
comments:
intent_template_id: 5
The output is 21 inches, which I think is correct.
In classified 142, the agent found wrong things in the trace of GPT4V. But it is evaluated as correct.
Hello!
I'm having issues with the shopping website where items won't display in search or catalog. From admin panel, I believe this is due to the fact that one of the indexers is invalid, which in turn is due to opensearch / elasticsearch not working.
Testing elasticsearch in the admin panel throws the error "Class "" does not exist" even when localhost is running elasticsearch on 9200, or 'No Alive Nodes Found' when either elasticsearch or opensearch is used. I was thus hoping you could provide more information about how the search feature is configured within visualwebarena, as well as how one might link the two together.
Hi, I ran the following commands in the environment readme to install classifieds environment, but encountered an OSClass Error:
unzip classifieds_docker_compose.zip
cd classifieds_docker_compose
vi classifieds_docker_compose/docker-compose.yml # Set CLASSIFIEDS to your site url `http://<your-server-hostname>:9980/`, and change the reset token if required
docker compose up --build -d
# Wait for compose up to finish. This may take a while on the first launch as it downloads several large images from dockerhub.
docker exec classifieds_db mysql -u root -ppassword osclass -e 'source docker-entrypoint-initdb.d/osclass_craigslist.sql' # Populate DB with content
Screenshot:
However, when I ran docker exec classifieds_db mysql -u root -ppassword osclass -e "SHOW TABLES;"
to query the database tables, it seems good.
Could you help me resolve this? Thanks!
Hello,
Could you please share some of the configuration settings to reproduce the various model types?
I tried to reproduce the caption-augmented setup (Acc Tre + Caps) but my value was closer to the Multimodal result that had the Image Screenshot also as an input. Hoping I could get more clarification on how to switch between the 4 modes.
Here is my configurations
(1) Text-Only
observation_type: accessibility_tree
action_set_tag: id_accessibility_tree
(2) Caption-Augmented
observation_type: accessibility_tree_with_captioner
action_set_tag: id_accessibility_tree
(3) Multimodal
observation_type: ???
action_set_tag: id_accessibility_tree
(4) Multimodal (SoM)
observation_type: image_som
action_set_tag: som
CogVLM only supports a single image as input. How do you evaluate CogVLM when there are multiple images as input. Thank you.
When the dataset is "visualwebarena", it seems that there is no GITLAB in env_config
For example, when searching for "blue kayak" http://ec2-3-13-232-171.us-east-2.compute.amazonaws.com:9980/index.php?page=search&sOrder=dt_pub_date&iOrderType=desc&sPattern=blue+kayak
The result shows nothing related to kayak at all (and none of them are red)
Thank you for your work. I can come to the the homepage with http://127.0.0.1:4399 but not the www.homepage.com. Do I have to change the host of my server ot navigate to www.homepage.com?
Hello,
I was wondering if all of the website pages were included in the google drive downloads for the visual web arena environments set up? For One Stop shop it only displays a total of 24 items (and no items under the category tab), at least for me.
Hi, I am trying to setup the classifieds website as outlined here https://github.com/web-arena-x/visualwebarena/blob/main/environment_docker/README.md#classifieds-website
When I execute docker exec classifieds_db mysql -u root -ppassword osclass -e 'source docker-entrypoint-initdb.d/osclass_craigslist.sql' # Populate DB with content
, I get the error ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2).
Since docker is not affected by my local environment, it should run ideally. Kindly help me resolve this.
My docker_compose.yml
is given below
version: '3.1'
services:
web:
image: jykoh/classifieds:latest
ports:
- "9980:9980"
depends_on:
- db
container_name: classifieds
environment:
- CLASSIFIEDS=http://127.0.0.1:9980/
- RESET_TOKEN=4b61655535e7ed388f0d40a93600254c
db:
image: mysql:8.1
restart: always
environment:
MYSQL_ROOT_PASSWORD: password
MYSQL_DATABASE: osclass
volumes:
- ./mysql:/docker-entrypoint-initdb.d
- db_data:/var/lib/mysql
container_name: classifieds_db
volumes:
db_data: {}
'''
Processing config_files/test_classifieds\5.json
2024-03-28 22:25:05,874 - INFO - [Config file]: config_files/test_classifieds\5.json
2024-03-28 22:25:05,875 - INFO - [Intent]: Navigate to my listing of the white car and delete it.
2024-03-28 22:25:06,181 - INFO - [Unhandled Error] InvalidSchema("No connection adapters were found for '127.0.0.1:9980/index.php?page=reset'")]
Processing config_files/test_classifieds\6.json
2024-03-28 22:25:06,182 - INFO - [Config file]: config_files/test_classifieds\6.json
2024-03-28 22:25:06,183 - INFO - [Intent]: Return the links of the 3 most recent motorcycles within $1000 to $2000 that are not orange.
Start testing config_files/test_classifieds\6.json
Finish testing config_files/test_classifieds\6.json
2024-03-28 22:27:15,567 - INFO - [Unhandled Error] LookupError("\n**********************************************************************\n Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt/english.pickle\x1b[0m\n\n Searched in:\n - 'C:\\Users\\PS/nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\share\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\lib\\nltk_data'\n - 'C:\\Users\\PS\\AppData\\Roaming\\nltk_data'\n - 'C:\\nltk_data'\n - 'D:\\nltk_data'\n - 'E:\\nltk_data'\n - ''\n**********************************************************************\n")]
Processing config_files/test_classifieds\7.json
2024-03-28 22:27:15,570 - INFO - [Config file]: config_files/test_classifieds\7.json
2024-03-28 22:27:15,570 - INFO - [Intent]: Return the links of the 2 most recent items in the "Cell phones" category within $300 to $600 that are white in color.
Start testing config_files/test_classifieds\7.json
Finish testing config_files/test_classifieds\7.json
2024-03-28 22:28:47,585 - INFO - [Unhandled Error] LookupError("\n**********************************************************************\n Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt/english.pickle\x1b[0m\n\n Searched in:\n - 'C:\\Users\\PS/nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\share\\nltk_data'\n - 'C:\\Users\\PS\\Desktop\\visualwebarena\\venv\\lib\\nltk_data'\n - 'C:\\Users\\PS\\AppData\\Roaming\\nltk_data'\n - 'C:\\nltk_data'\n - 'D:\\nltk_data'\n - 'E:\\nltk_data'\n - ''\n**********************************************************************\n")]
'''
Thank you for your work. I reset the reddit by
bash ./scripts/reset_redddit.sh
Then run the prepare.sh, but alway meet the following error:
Traceback (most recent call last):
File "code/vwa/browser_env/auto_login.py", line 182, in <module>
main()
File "code/vwa/browser_env/auto_login.py", line 173, in main
assert not future.result(), f"Cookie {cookie_files[i]} expired."
AssertionError: Cookie ./.auth/reddit_state.json expired.
I am trying to show the trace of one of the trace files, 463.trace.zip. This is the command I am using:
unzip 463.trace.zip -d 463_trace
xvfb-run playwright show-trace 463_trace
It has been sitting for several hours. Is this expected, or is there a better way to extract the trace?
Thanks!
I've been able to host the shopping website successfully but noticed that running scripts/reset_shopping.sh causes the shopping website to clear all items not on the homepage, i.e. all the categories no longer have items. Specifically, it appears the command
docker exec $CONTAINER_NAME /var/www/magento2/bin/magento setup:store-config:set --base-url="http://localhost:7770" # no trailing slash
is causing this issue.
I was wondering what this command does in the context of the repo, and if it's safe to remove? If not, do you know what about this command could be causing the issue? I am currently hosting the shopping website on http://127.0.0.1:7770.
On the Classifieds and Reddit tasks, there are multiple image links that do not exist. An example of such errors is as follows:
L616 WARNING: cannot identify image file <_io.BytesIO object at 0x7f8b5c2f1ee0>
Why you have to remove the dupulicated content. Does this filter out cases where two elements with the same text?
if content in text_content_text:
# Remove text_content_elements with content
text_content_elements = [
element
for element in text_content_elements
if element.strip() != content
]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.