normal-computing / fuji-web Goto Github PK

View Code? Open in Web Editor NEW

213.0 213.0 17.0 117.44 MB

Fuji is an AI agent that lives in your browser's sidepanel. You can now get tasks done online with a single command!

License: Apache License 2.0

TypeScript 93.27% CSS 1.13% HTML 0.67% JavaScript 4.58% SCSS 0.28% Shell 0.06%

ai ai-agent browser-extension gpt llm react web-agent

fuji-web's Issues

New tools: find text on page

https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/find/find

Would be useful on long pages when the task is about find certain information

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/setup-node@v3, actions/cache@v3, pnpm/action-setup@v2, actions/upload-artifact@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.

Prepare demos

demo candidates
ensure smooth workflow in targeted website
record video
get feedbacks

Advanced agent control

add new settings:

choose to use non-vision mode when the selected model support vision API
In vision mode, choose to not include annotated screenshot (clean screenshot only)
use different audio input / output

WebWand release blog post

Improve README

Implement more sophisticated methods to detect whether the page has finished loading

Currently we wait 2 seconds after taking an action (e.g. click a button). It might not be enough for slow network.
Need to find a better way to detect when the loading/page state change is done.

Support scrolling portion of web page

As of now WebWand doesn't scroll well when the scrollable portion is in part of the page (e.g. a dialog with long content) instead of the whole body of page

One thought is we can provide all scrollable elements on the page and ask for an ID when agent wants to scroll. But it might be tricky to tell the agent what those wrappers are, because unlike buttons and inputs they are often just

with CSS.

Maybe we need have a subagent just for this. In that case we can send a separate screenshot to indicate where are each scrollable portion.

Should not click button that's disabled

Describe the bug
When a button is disabled, web-wand still clicks it. This happens when I use GPT-4 Vision. It might because it cannot distinguish between active and disabled buttons.

To Reproduce
Steps to reproduce the behavior:

Run web-wand and asks it to "comment "test2" in the any github issue.

Expected behavior
As shown in the screenshot below, the comment button is disabled when the comment is empty. So web-wand should fill out the content first before click the button. It should collect the information that the button is disabled now.

Screenshots

Desktop (please complete the following information):

Browser [e.g. chrome, firefox]

Research: Prior Knowledge Augmentation

https://www.notion.so/normal-ai/Prior-Knowledge-Augmentation-for-LLM-based-GUI-Agent-4955e274e1b64b298d532251c201a264?pvs=4

Audio streaming

Implement Audio Streaming in Webwand Extension

Problem:

The Webwand Chrome extension currently requires generating the full audio blob from website content before playback starts, causing delays and reducing efficiency, particularly with longer texts.

Desired Solution:

Implement real-time audio streaming to allow immediate playback as text is converted to speech, improving user experience by providing instant auditory feedback without waiting for complete text processing.

Enhance system prompt

Describe the bug
For a simple task like "draft an email to reply susan and send", even GPT-4-vision produce wrong action format. We need to add at least one example for each action in the system prompt.

To Reproduce

run web wand on chrome
open an email on browser, and then put "draft an email to reply susan and send" in the webwand inbox.

Expected behavior
web wand should produce correct action.

Screenshots

Desktop (please complete the following information):

Browser [chrome]

More robust typing implementation

Currently the setValue tool does this following:

click on the input (or the label)
select all text
type

This would work well usually, but some web pages might set up the page to prevent selecting the input.
We can send command to "DeleteToBeginningOfParagraph" + "DeleteToEndOfParagraph" to double delete it.

[Idea] provide more information on action history

We can try detect and add these information in prompt

agent tried to interactive with an element that does not exist
nothing visibly changed after last action (by comparing screenshots locally)

Cleanup: after refactor

src/common and src/state should be under src/shared
most of current src/shared is not used

bug: Incorrect syntax when using setValue tool

There's a change the model responses with incorrect ' inside of setValue, for example

This does not throw, but typing will stop unexpectedly

Possible solutions:

in prompt, remind the LLM about this issue
catch this error during parsing, then retry
use a different response format

Support Claude 3 models and custom baseUrl

dependency: #56

click the input but forget to set the value.

Describe the bug
Sometimes webwand clicks the input but forget to set the value. we need to enforce (click + setvalue) sequence for any input operation (edited)

Expected behavior
If there's a input task, then must setValue or click + setValue

Screenshots

Support more tools for complex tasks

Depended on #53

This would enable interesting workflows like "post a meme picture about workaholic in this slack chat"

New tools like

get & save summary of page (news article, blog, etc.) (maybe can use https://github.com/mozilla/readability)
get & save a image

New memory

observed pages and their tab id

Note: before working on this, create individual issues and use this as parent

Prompt enhancement: experiment prompting technique of microsoft/UFO

https://arxiv.org/abs/2402.07939

Support the text agent

Describe the bug
If the user selects models that do not support vision, like gpt-3.5, then the agent cannot find the element on the website and stuck due to "Unable to find element with selector: "

To Reproduce
Steps to reproduce the behavior:

start web wand service and open chrome extension.
start any task.

Expected behavior
Correct selector should be extracted from action, and getObjectIdBySelector should return true by finding that element.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Browser: chrome

Additional context
This is due to that the old Taxy agent is not fully supported (the current agent is built to support gpt4-vision). We need to support the old Taxy agent by retrieving correct id from simplified dom, updating available action prompt and system prompt, updating the label attribute constant, etc.

alert user when the website content hits token limit

Describe the bug
when the website content exceeds the token limit of text-based gpt model like 3.5 or 4.0, webwand throws an error.

Expected behavior
throws an alert window to let user know that the problem is token limit exceed.

Add typing speed argument

In some cases we would like to be able to control the speed of typing into a box - for example may be desirable to type very long messages faster than short messages

lookup interactive elements with CSS

Currently, we query interactive elements using HTML semantics. There are some websites that do not always follow the rules. For example, I've seen this kind of button

<div class="btn_primary inline-flex cursor-pointer items-center gap-1 text-center font-semibold">New Job</div> New Job

They are currently ignored by WebWand.

However, if we traverse the entire DOM, we can find these elements by checking whether the CSS style has cursor: pointer, and assume it is also interactive.

Depending on the website, this might get more things done, but this can also sometimes mark some unwanted things as interactive and confuse the agent.

We might make this an option in settings, or allow as part of per-website rule

Refactor available action

Provide more information for long-pending state in UI

In some cases, the program can be waiting for things and nothing will show in the UI. For example:

wait for page to be (fully) loaded
wait for API response

It doesn't happen a lot, but when it takes more than 5 seconds to load, we should notify user in UI

feature: Human-in-the-loop

allow agent to proactively pause and wait for user's input (additional prompt, or take actions on the page such as enter username and password), and continue the task
For simplicity we can ask user to click a button to continue

Settings: more audio options

Current: window.SpeechRecognition for input + OpenAI audio for output

We want to allow customize each of them and offer different options such as Azure AI Speech

Allow providing custom selector instead of label for actions

Add more actions for agent

wait (to handle cases when some requests are taking a long time)
hover over (some websites put buttons in popups shown only when hovering)
scroll directly to top or bottom of page

Release automation

We should be able to use a script to prepare a new release (.zip file the contains the extension), and optionally put it on github

Add Non-vision model support

Use a simplified DOM string instead screenshot in the agent. Currently old Taxy agent is not fully supported. If the user selects models that do not support vision, like gpt-3.5, then the agent cannot find the element on the website and stuck due to "Unable to find element with selector: "

We need to support the old Taxy agent by retrieving correct id from simplified dom, updating available action prompt and system prompt, updating the label attribute constant, etc.

Support dropdown

We currently don’t support any kind of dropdown menu well

Export Actions

Extract a lib for "Web Wand Actions" (click, type, etc.) for other packages to use

Support opening up a new tab for a task

given a task, use a new tab instead of operating in the current tab when it makes more sense

Additional information in prompting

Intuitively these should help:

tell model if an input or textarea is correctly focus so they might want to type there https://developer.mozilla.org/en-US/docs/Web/API/Document/activeElement
tell model what are the current value, placeholder and name of these elements, instead of just name

Build voice input and output system to support voice controlled browsing

Optimize knowledge in UI for easier understanding

only show notes instead of entire json
temporarily remove copying & creating with json
remove regex -- all notes from user settings match all urls under the current host
rename notes to "instructions"
show "instructions for this website" in main UI

Host knowledge DB on the internet

DB host
update extension to fetch from host
allow extension to fetch and merge multiple knowledge bases
UI for updating knowledge + validation
Auth
UI for forking
UI for submitting merge request to main DB

As of now, knowledge is stored here as JSON files: https://github.com/normal-computing/web-wand/tree/main/src/helpers/knowledge

We want to create a independent host for it, so that it can be dynamically updated for every user.
At the same time, we should create a web UI so that users can login to create and update their own rulesets. They can also request their ruleset to be merged to the base ruleset.

settings view

As we add features to this product, we should create a settings view (instead of just one option to reset OpenAI key)

The settings view should be used to:

manage OpenAI key
choose a LLM model
turn on/off audio mode

In the future it can also be used to:

manage Claude key (go with #55)
manage knowledge bases (#54)

Write documentations: Contribution Guide, Troubleshooting Guide, and Contributing Knowledge Guide

Write three docs for WebWand release

Contribution Guide
Troubleshooting Guide
Contributing Knowledge Guide

Prepare for open source

video demos #58
README #60
release (zip for download) #59
write a blog post

https://www.notion.so/normal-ai/Web-Agent-Next-Steps-524fd20ce0c84991bb72ddd130de3c28?d=b3ae0ae425f744d0ba45ab5e2cb34487#55bf36dfb274412eb07812711e6b830f

Support Custom model name

Support managing multiple tabs during one task

During one task, allow opening up new tabs (for example, search for information on google), and navigate between them to complete the task

Custom knowledge base in settings

dependency: #56

motivation

Low-effort workaround for #54 since that is unlikely to be finished before open sourcing.
This is also nice to keep after #54 is done, since it provides a nice way to test things locally.

what

Add a textarea in settings view. If the data structure is correct, use it to enhance the prompting.
Add description to guide users.
Optionally: build a UI for that purpose so that people can copy paste the JSON instead of manually crafting it.

Handle typing when no input element is focused (expedia fake input)

There are cases when the agent decides to type in an "input" on a page, but it turns out to be a fake one. For example, on expedia, this "Date" label looks very much like an input

However, no input is focused after clicking it. Instead, it opens up a dialog expecting button clicking.

Typing in this case is tricky because it can trigger keyboard shortcut websites, or at least "space" would cause browser to scroll to next screen in Chrome.

We need to think of a way to handle this problem

Research: Evaluate the performance using benchmarks

candidates

Goals:

find a appropriate benchmark and understand the engineering workload to evaluate WebWand

normal-computing / fuji-web Goto Github PK

fuji-web's Issues

Problem:

Desired Solution:

motivation

what

Recommend Projects

Recommend Topics

Recommend Org