normal-computing / fuji-web Goto Github PK
View Code? Open in Web Editor NEWFuji is an AI agent that lives in your browser's sidepanel. You can now get tasks done online with a single command!
License: Apache License 2.0
Fuji is an AI agent that lives in your browser's sidepanel. You can now get tasks done online with a single command!
License: Apache License 2.0
https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/find/find
Would be useful on long pages when the task is about find certain information
We got these warnings:
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/setup-node@v3, actions/cache@v3, pnpm/action-setup@v2, actions/upload-artifact@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
add new settings:
Currently we wait 2 seconds after taking an action (e.g. click a button). It might not be enough for slow network.
Need to find a better way to detect when the loading/page state change is done.
As of now WebWand doesn't scroll well when the scrollable portion is in part of the page (e.g. a dialog with long content) instead of the whole body of page
One thought is we can provide all scrollable elements on the page and ask for an ID when agent wants to scroll. But it might be tricky to tell the agent what those wrappers are, because unlike buttons and inputs they are often just
Maybe we need have a subagent just for this. In that case we can send a separate screenshot to indicate where are each scrollable portion.
Describe the bug
When a button is disabled, web-wand still clicks it. This happens when I use GPT-4 Vision. It might because it cannot distinguish between active and disabled buttons.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
As shown in the screenshot below, the comment button is disabled when the comment is empty. So web-wand should fill out the content first before click the button. It should collect the information that the button is disabled now.
Desktop (please complete the following information):
Implement Audio Streaming in Webwand Extension
The Webwand Chrome extension currently requires generating the full audio blob from website content before playback starts, causing delays and reducing efficiency, particularly with longer texts.
Implement real-time audio streaming to allow immediate playback as text is converted to speech, improving user experience by providing instant auditory feedback without waiting for complete text processing.
Describe the bug
For a simple task like "draft an email to reply susan and send", even GPT-4-vision produce wrong action format. We need to add at least one example for each action in the system prompt.
To Reproduce
Expected behavior
web wand should produce correct action.
Desktop (please complete the following information):
Currently the setValue
tool does this following:
This would work well usually, but some web pages might set up the page to prevent selecting the input.
We can send command to "DeleteToBeginningOfParagraph" + "DeleteToEndOfParagraph" to double delete it.
We can try detect and add these information in prompt
dependency: #56
Depended on #53
This would enable interesting workflows like "post a meme picture about workaholic in this slack chat"
New tools like
New memory
Note: before working on this, create individual issues and use this as parent
Describe the bug
If the user selects models that do not support vision, like gpt-3.5, then the agent cannot find the element on the website and stuck due to "Unable to find element with selector: "
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Correct selector should be extracted from action, and getObjectIdBySelector should return true by finding that element.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
This is due to that the old Taxy agent is not fully supported (the current agent is built to support gpt4-vision). We need to support the old Taxy agent by retrieving correct id from simplified dom, updating available action prompt and system prompt, updating the label attribute constant, etc.
Describe the bug
when the website content exceeds the token limit of text-based gpt model like 3.5 or 4.0, webwand throws an error.
Expected behavior
throws an alert window to let user know that the problem is token limit exceed.
In some cases we would like to be able to control the speed of typing into a box - for example may be desirable to type very long messages faster than short messages
Currently, we query interactive elements using HTML semantics. There are some websites that do not always follow the rules. For example, I've seen this kind of button
<div class="btn_primary inline-flex cursor-pointer items-center gap-1 text-center font-semibold">New Job</div> New Job
They are currently ignored by WebWand.
However, if we traverse the entire DOM, we can find these elements by checking whether the CSS style has cursor: pointer
, and assume it is also interactive.
Depending on the website, this might get more things done, but this can also sometimes mark some unwanted things as interactive and confuse the agent.
We might make this an option in settings, or allow as part of per-website rule
In some cases, the program can be waiting for things and nothing will show in the UI. For example:
It doesn't happen a lot, but when it takes more than 5 seconds to load, we should notify user in UI
allow agent to proactively pause and wait for user's input (additional prompt, or take actions on the page such as enter username and password), and continue the task
For simplicity we can ask user to click a button to continue
Current: window.SpeechRecognition for input + OpenAI audio for output
We want to allow customize each of them and offer different options such as Azure AI Speech
We should be able to use a script to prepare a new release (.zip file the contains the extension), and optionally put it on github
Use a simplified DOM string instead screenshot in the agent. Currently old Taxy agent is not fully supported. If the user selects models that do not support vision, like gpt-3.5, then the agent cannot find the element on the website and stuck due to "Unable to find element with selector: "
We need to support the old Taxy agent by retrieving correct id from simplified dom, updating available action prompt and system prompt, updating the label attribute constant, etc.
We currently don’t support any kind of dropdown menu well
Extract a lib for "Web Wand Actions" (click, type, etc.) for other packages to use
given a task, use a new tab instead of operating in the current tab when it makes more sense
Intuitively these should help:
As of now, knowledge is stored here as JSON files: https://github.com/normal-computing/web-wand/tree/main/src/helpers/knowledge
We want to create a independent host for it, so that it can be dynamically updated for every user.
At the same time, we should create a web UI so that users can login to create and update their own rulesets. They can also request their ruleset to be merged to the base ruleset.
As we add features to this product, we should create a settings view (instead of just one option to reset OpenAI key)
The settings view should be used to:
In the future it can also be used to:
Write three docs for WebWand release
During one task, allow opening up new tabs (for example, search for information on google), and navigate between them to complete the task
dependency: #56
Low-effort workaround for #54 since that is unlikely to be finished before open sourcing.
This is also nice to keep after #54 is done, since it provides a nice way to test things locally.
Add a textarea in settings view. If the data structure is correct, use it to enhance the prompting.
Add description to guide users.
Optionally: build a UI for that purpose so that people can copy paste the JSON instead of manually crafting it.
There are cases when the agent decides to type in an "input" on a page, but it turns out to be a fake one. For example, on expedia, this "Date" label looks very much like an input
However, no input is focused after clicking it. Instead, it opens up a dialog expecting button clicking.
Typing in this case is tricky because it can trigger keyboard shortcut websites, or at least "space" would cause browser to scroll to next screen in Chrome.
We need to think of a way to handle this problem
candidates
Goals:
Describe the bug
The current design is that if the voice mode is on, runTask will be automatically triggered upon stop listening. This works all the time except when webwand is loaded for the firsts time.
Expected behavior
Should run task automatically with voice mode on for the first time.
Status Quo:
When in the voice mode, webwand read out loud the instruction and action history.
Expected
Webwand can also read out the observation of the current active tab when user explicitly asks it to do so.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.