Giter VIP home page Giter VIP logo

fuji-web's Issues

Prepare demos

  • demo candidates
  • ensure smooth workflow in targeted website
  • record video
  • get feedbacks

Advanced agent control

add new settings:

  • choose to use non-vision mode when the selected model support vision API
  • In vision mode, choose to not include annotated screenshot (clean screenshot only)
  • use different audio input / output

Improve README

  • icon
  • banner
  • one line description
  • cool demo video #58
  • getting started/installation & run
  • roadmap
  • credit

Support scrolling portion of web page

As of now WebWand doesn't scroll well when the scrollable portion is in part of the page (e.g. a dialog with long content) instead of the whole body of page

One thought is we can provide all scrollable elements on the page and ask for an ID when agent wants to scroll. But it might be tricky to tell the agent what those wrappers are, because unlike buttons and inputs they are often just

with CSS.

Maybe we need have a subagent just for this. In that case we can send a separate screenshot to indicate where are each scrollable portion.

Should not click button that's disabled

Describe the bug
When a button is disabled, web-wand still clicks it. This happens when I use GPT-4 Vision. It might because it cannot distinguish between active and disabled buttons.

To Reproduce
Steps to reproduce the behavior:

  1. Run web-wand and asks it to "comment "test2" in the any github issue.

Expected behavior
As shown in the screenshot below, the comment button is disabled when the comment is empty. So web-wand should fill out the content first before click the button. It should collect the information that the button is disabled now.

Screenshots
Screenshot 2024-03-04 at 3 26 10 PM

Desktop (please complete the following information):

  • Browser [e.g. chrome, firefox]

Audio streaming

Implement Audio Streaming in Webwand Extension

Problem:

The Webwand Chrome extension currently requires generating the full audio blob from website content before playback starts, causing delays and reducing efficiency, particularly with longer texts.

Desired Solution:

Implement real-time audio streaming to allow immediate playback as text is converted to speech, improving user experience by providing instant auditory feedback without waiting for complete text processing.

Enhance system prompt

Describe the bug
For a simple task like "draft an email to reply susan and send", even GPT-4-vision produce wrong action format. We need to add at least one example for each action in the system prompt.

To Reproduce

  1. run web wand on chrome
  2. open an email on browser, and then put "draft an email to reply susan and send" in the webwand inbox.

Expected behavior
web wand should produce correct action.

Screenshots
Screenshot 2024-03-05 at 3 22 38 PM

Desktop (please complete the following information):

  • Browser [chrome]

More robust typing implementation

Currently the setValue tool does this following:

  • click on the input (or the label)
  • select all text
  • type

This would work well usually, but some web pages might set up the page to prevent selecting the input.
We can send command to "DeleteToBeginningOfParagraph" + "DeleteToEndOfParagraph" to double delete it.

[Idea] provide more information on action history

We can try detect and add these information in prompt

  • agent tried to interactive with an element that does not exist
  • nothing visibly changed after last action (by comparing screenshots locally)

Cleanup: after refactor

  1. src/common and src/state should be under src/shared
  2. most of current src/shared is not used

bug: Incorrect syntax when using setValue tool

There's a change the model responses with incorrect ' inside of setValue, for example
image

This does not throw, but typing will stop unexpectedly
image

Possible solutions:

  1. in prompt, remind the LLM about this issue
  2. catch this error during parsing, then retry
  3. use a different response format

click the input but forget to set the value.

Describe the bug
Sometimes webwand clicks the input but forget to set the value. we need to enforce (click + setvalue) sequence for any input operation (edited)

Expected behavior
If there's a input task, then must setValue or click + setValue

Screenshots
Screenshot 2024-03-19 at 3 23 02 PM

Support more tools for complex tasks

Depended on #53

This would enable interesting workflows like "post a meme picture about workaholic in this slack chat"

New tools like

New memory

  • observed pages and their tab id

Note: before working on this, create individual issues and use this as parent

Support the text agent

Describe the bug
If the user selects models that do not support vision, like gpt-3.5, then the agent cannot find the element on the website and stuck due to "Unable to find element with selector: "

To Reproduce
Steps to reproduce the behavior:

  1. start web wand service and open chrome extension.
  2. start any task.

Expected behavior
Correct selector should be extracted from action, and getObjectIdBySelector should return true by finding that element.

Screenshots
If applicable, add screenshots to help explain your problem.
Screenshot 2024-02-14 at 2 48 08 PM

Desktop (please complete the following information):

  • Browser: chrome

Additional context
This is due to that the old Taxy agent is not fully supported (the current agent is built to support gpt4-vision). We need to support the old Taxy agent by retrieving correct id from simplified dom, updating available action prompt and system prompt, updating the label attribute constant, etc.

alert user when the website content hits token limit

Describe the bug
when the website content exceeds the token limit of text-based gpt model like 3.5 or 4.0, webwand throws an error.

Expected behavior
throws an alert window to let user know that the problem is token limit exceed.

Add typing speed argument

In some cases we would like to be able to control the speed of typing into a box - for example may be desirable to type very long messages faster than short messages

lookup interactive elements with CSS

Currently, we query interactive elements using HTML semantics. There are some websites that do not always follow the rules. For example, I've seen this kind of button

<div class="btn_primary inline-flex cursor-pointer items-center gap-1 text-center font-semibold">New Job</div> New Job

They are currently ignored by WebWand.

However, if we traverse the entire DOM, we can find these elements by checking whether the CSS style has cursor: pointer, and assume it is also interactive.

Depending on the website, this might get more things done, but this can also sometimes mark some unwanted things as interactive and confuse the agent.

We might make this an option in settings, or allow as part of per-website rule

Provide more information for long-pending state in UI

In some cases, the program can be waiting for things and nothing will show in the UI. For example:

  • wait for page to be (fully) loaded
  • wait for API response

It doesn't happen a lot, but when it takes more than 5 seconds to load, we should notify user in UI

feature: Human-in-the-loop

allow agent to proactively pause and wait for user's input (additional prompt, or take actions on the page such as enter username and password), and continue the task
For simplicity we can ask user to click a button to continue

Settings: more audio options

Current: window.SpeechRecognition for input + OpenAI audio for output

We want to allow customize each of them and offer different options such as Azure AI Speech

Add more actions for agent

  • wait (to handle cases when some requests are taking a long time)
  • hover over (some websites put buttons in popups shown only when hovering)
  • scroll directly to top or bottom of page

Release automation

We should be able to use a script to prepare a new release (.zip file the contains the extension), and optionally put it on github

Add Non-vision model support

Use a simplified DOM string instead screenshot in the agent. Currently old Taxy agent is not fully supported. If the user selects models that do not support vision, like gpt-3.5, then the agent cannot find the element on the website and stuck due to "Unable to find element with selector: "

We need to support the old Taxy agent by retrieving correct id from simplified dom, updating available action prompt and system prompt, updating the label attribute constant, etc.

Export Actions

Extract a lib for "Web Wand Actions" (click, type, etc.) for other packages to use

Optimize knowledge in UI for easier understanding

  • only show notes instead of entire json
  • temporarily remove copying & creating with json
  • remove regex -- all notes from user settings match all urls under the current host
  • rename notes to "instructions"
  • show "instructions for this website" in main UI

Host knowledge DB on the internet

  • DB host
  • update extension to fetch from host
  • allow extension to fetch and merge multiple knowledge bases
  • UI for updating knowledge + validation
  • Auth
  • UI for forking
  • UI for submitting merge request to main DB

As of now, knowledge is stored here as JSON files: https://github.com/normal-computing/web-wand/tree/main/src/helpers/knowledge

We want to create a independent host for it, so that it can be dynamically updated for every user.
At the same time, we should create a web UI so that users can login to create and update their own rulesets. They can also request their ruleset to be merged to the base ruleset.

settings view

As we add features to this product, we should create a settings view (instead of just one option to reset OpenAI key)

The settings view should be used to:

  • manage OpenAI key
  • choose a LLM model
  • turn on/off audio mode

In the future it can also be used to:

  • manage Claude key (go with #55)
  • manage knowledge bases (#54)

Custom knowledge base in settings

dependency: #56

motivation

Low-effort workaround for #54 since that is unlikely to be finished before open sourcing.
This is also nice to keep after #54 is done, since it provides a nice way to test things locally.

what

Add a textarea in settings view. If the data structure is correct, use it to enhance the prompting.
Add description to guide users.
Optionally: build a UI for that purpose so that people can copy paste the JSON instead of manually crafting it.

Handle typing when no input element is focused (expedia fake input)

There are cases when the agent decides to type in an "input" on a page, but it turns out to be a fake one. For example, on expedia, this "Date" label looks very much like an input
image

However, no input is focused after clicking it. Instead, it opens up a dialog expecting button clicking.

image

Typing in this case is tricky because it can trigger keyboard shortcut websites, or at least "space" would cause browser to scroll to next screen in Chrome.

We need to think of a way to handle this problem

Recommend use cases

When no action is in progress, recommend some use cases buttons

Like ChatGPT
image

Advance Voice Support

Status Quo:
When in the voice mode, webwand read out loud the instruction and action history.

Expected
Webwand can also read out the observation of the current active tab when user explicitly asks it to do so.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.