Giter VIP home page Giter VIP logo

ufo's Introduction

UFO UFO Image: A UI-Focused Agent for Windows OS Interaction

arxivPython VersionLicense: MITWelcome

UFO is a UI-Focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.

🕌 Framework

UFO UFO Image operates as a dual-agent framework, encompassing:

  • AppAgent 🤖, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
  • ActAgent 👾, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
  • Control Interaction 🎮, is tasked with translating actions from AppAgent and ActAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows UI Automation API.

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our technical report.

📢 News

  • 📅 2024-03-25: New Release for v0.0.1! Check out our exciting new features:
    1. We now support creating your help documents for each Windows application to become an app expert. Check the README for more details!
    2. UFO now supports RAG from offline documents and online Bing search.
    3. You can save the task completion trajectory into its memory for UFO's reference, improving its future success rate!
    4. You can customize different GPT models for AppAgent and ActAgent. Text-only models (e.g., GPT-4) are now supported!
  • 📅 2024-02-14: Our technical report is online!
  • 📅 2024-02-10: UFO is released on GitHub🎈. Happy Chinese New year🐉!

🌐 Media Coverage

UFO sightings have garnered attention from various media outlets, including:

These sources provide insights into the evolving landscape of technology and the implications of UFO phenomena on various platforms.

💥 Highlights

  • First Windows Agent - UFO is the pioneering agent framework capable of translating user requests in natural language into actionable operations on Windows OS.
  • RAG Enhanced - UFO is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources to promote its ability, including offling help documents and online search engine.
  • Interactive Mode - UFO facilitates multiple sub-requests from users within the same session, enabling the completion of complex tasks seamlessly.
  • Action Safeguard - UFO incorporates safeguards to prompt user confirmation for sensitive actions, enhancing security and preventing inadvertent operations.
  • Easy Extension - UFO offers extensibility, allowing for the integration of additional functionalities and control types to tackle diverse and intricate tasks with ease.

✨ Getting Started

🛠️ Step 1: Installation

UFO requires Python >= 3.10 running on Windows OS >= 10. It can be installed by running the following command:

# [optional to create conda environment]
# conda create -n ufo python=3.10
# conda activate ufo

# clone the repository
git clone https://github.com/microsoft/UFO.git
cd UFO
# install the requirements
pip install -r requirements.txt

⚙️ Step 2: Configure the LLMs

Before running UFO, you need to provide your LLM configurations individully for AppAgent and ActAgent. You can create your own config file ufo/config/config.yaml, by copying the ufo/config/config.yaml.template and editing config for APP_AGENT and ACTION_AGENT as follows:

OpenAI

VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "openai" , # The API type, "openai" for the OpenAI API.  
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint.
API_KEY: "sk-",  # The OpenAI API key, begin with sk-
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input

Azure OpenAI (AOAI)

VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "aoai" , # The API type, "aoai" for the Azure OpenAI.  
API_BASE: "YOUR_ENDPOINT", #  The AOAI API address. Format: https://{your-resource-name}.openai.azure.com
API_KEY: "YOUR_KEY",  # The aoai API key
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API

You can also non-visial model (e.g., GPT-4) for each agent, by setting VISUAL_MODE: True and proper API_MODEL (openai) and API_DEPLOYMENT_ID (aoai). You can also optionally set an backup LLM engine in the field of BACKUP_AGENT if the above engines failed during the inference.

Non-Visual Model Configuration

You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the config.yaml file:

  • VISUAL_MODE: False # To enable non-visual mode.
  • Specify the appropriate API_MODEL (OpenAI) and API_DEPLOYMENT_ID (AOAI) for each agent.

Optionally, you can set a backup language model (LLM) engine in the BACKUP_AGENT field to handle cases where the primary engines fail during inference. Ensure you configure these settings accurately to leverage non-visual models effectively.

📔 Step 3: Additional Setting for RAG (optional).

If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the ufo/config/config.yaml file.

RAG from Offline Help Document

Before enabling this function, you need to create an offline indexer for your help document. Please refer to the README to learn how to create an offline vectored database for retrieval. You can enable this function by setting the following configuration:

## RAG Configuration for the offline docs
RAG_OFFLINE_DOCS: True  # Whether to use the offline RAG.
RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1  # The topk for the offline retrieved documents

Adjust RAG_OFFLINE_DOCS_RETRIEVED_TOPK to optimize performance.

RAG from Online Bing Search Engine

Enhance UFO's ability by utilizing the most up-to-date online search results! To use this function, you need to obtain a Bing search API key. Activate this feature by setting the following configuration:

## RAG Configuration for the Bing search
BING_API_KEY: "YOUR_BING_SEARCH_API_KEY"  # The Bing search API key
RAG_ONLINE_SEARCH: True  # Whether to use the online search for the RAG.
RAG_ONLINE_SEARCH_TOPK: 5  # The topk for the online search
RAG_ONLINE_RETRIEVED_TOPK: 1 # The topk for the online retrieved documents

Adjust RAG_ONLINE_SEARCH_TOPK and RAG_ONLINE_RETRIEVED_TOPK to get better performance.

RAG from Self-Demonstration

Save task completion trajectories into UFO's memory for future reference. This can improve its future success rates based on its previous experiences!

After completing a task, you'll see the following message:

Would you like to save the current conversation flow for future reference by the agent?
[Y] for yes, any other key for no.

Press Y to save it into its memory and enable memory retrieval via the following configuration:

## RAG Configuration for experience
RAG_EXPERIENCE: True  # Whether to use the RAG from its self-experience.
RAG_EXPERIENCE_RETRIEVED_TOPK: 5  # The topk for the offline retrieved documents

🎉 Step 4: Start UFO

⌨️ You can execute the following on your Windows command Line (CLI):

# assume you are in the cloned UFO folder
python -m ufo --task <your_task_name>

This will start the UFO process and you can interact with it through the command line interface. If everything goes well, you will see the following message:

Welcome to use UFO🛸, A UI-focused Agent for Windows OS Interaction. 
 _   _  _____   ___
| | | ||  ___| / _ \
| | | || |_   | | | |
| |_| ||  _|  | |_| |
 \___/ |_|     \___/
Please enter your request to be completed🛸:

⚠️Reminder:

  • Before UFO executing your request, please make sure the targeted applications are active on the system.
  • The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to DISCLAIMER.md.

Step 5 🎥: Execution Logs

You can find the screenshots taken and request & response logs in the following folder:

./ufo/logs/<your_task_name>/

You may use them to debug, replay, or analyze the agent output.

❓Get help


🎬 Demo Examples

We present two demo videos that complete user request on Windows OS using UFO. For more case study, please consult our technical report.

1️⃣🗑️ Example 1: Deleting all notes on a PowerPoint presentation.

In this example, we will demonstrate how to efficiently use UFO to delete all notes on a PowerPoint presentation with just a few simple steps. Explore this functionality to enhance your productivity and work smarter, not harder!

ufo_delete_note.mp4

2️⃣📧 Example 2: Composing an email using text from multiple sources.

In this example, we will demonstrate how to utilize UFO to extract text from Word documents, describe an image, compose an email, and send it seamlessly. Enjoy the versatility and efficiency of cross-application experiences with UFO!

ufo_meeting_note_crossed_app_demo_new.mp4

📊 Evaluation

Please consult the WindowsBench provided in Section A of the Appendix within our technical report. Here are some tips (and requirements) to aid in completing your request:

  • Prior to UFO execution of your request, ensure that the targeted application is active (though it may be minimized).
  • Occasionally, requests to GPT-V may trigger content safety measures. UFO will attempt to retry regardless, but adjusting the size or scale of the application window may prove helpful. We are actively solving this issue.
  • Currently, UFO supports a limited set of applications and UI controls that are compatible with the Windows UI Automation API. Our future plans include extending support to the Win32 API to enhance its capabilities.
  • Please note that the output of GPT-V may not consistently align with the same request. If unsuccessful with your initial attempt, consider trying again.

📚 Citation

Our technical report paper can be found here. If you use UFO in your research, please cite our paper:

@article{ufo,
  title={{UFO: A UI-Focused Agent for Windows OS Interaction}},
  author={Zhang, Chaoyun and Li, Liqun and He, Shilin and  Zhang, Xu and Qiao, Bo and  Qin, Si and Ma, Minghua and Kang, Yu and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and  Zhang, Qi},
  journal={arXiv preprint arXiv:2402.07939},
  year={2024}
}

📝 Todo List

  • RAG enhanced UFO.
  • Documentation.
  • Support local host GUI interaction model.
  • Support more control using Win32 API.
  • Chatbox GUI for UFO.

🎨 Related Project

You may also find TaskWeaver useful, a code-first LLM agent framework for seamlessly planning and executing data analytics tasks.

⚠️ Disclaimer

By choosing to run the provided code, you acknowledge and agree to the following terms and conditions regarding the functionality and data handling practices in DISCLAIMER.md

logo Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

ufo's People

Contributors

al-377 avatar dependabot[bot] avatar eltociear avatar lenny2liu avatar lserinol avatar mac0q avatar microsoft-github-operations[bot] avatar microsoftopensource avatar saifeilee avatar vyokky avatar yunhao0204 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ufo's Issues

Local models?

Will local models be supported one day as well?
(Unless they are, and I didn't find it in the readme XD)

Azure API base instruction wrong?

Hi. I think you have the Azure API base instruction wrong.

I tried https://{your-resource-name}.openai.azure.com/openai/deployments/{deployment-id}/completions?api-version={api-version}
However got errors - error message was the model was ?looking for GPT 4

Instead I used https://{your-resource-name}).openai.azure.com/openai/deployments/{deployment-id}/chat/completions?api-version=2023-12-01-preview, which seems to work. (i.e. add in '/chat' after deployment ID - which seems to be in some other instructions on general sites around using Azure GPT4V).

Is this correct now, or will this create issues for me?

thanks

Connection not working with AOAI

Hey there!
Thank you for you amazing work so far! I was looking forward to try your agent, but I'm not able to use my AOAI credentials with it. I'm putting this in config.yaml:

image

Error making API request: Invalid URL 'YOUR_ENDPOINT': No scheme supplied.

Created a config.yaml and populated with API key and "gpt-4-vision-preview" as the model as per the instructions, but am getting the following error:
Step 0: Selecting an application.
Error making API request: Invalid URL 'YOUR_ENDPOINT': No scheme supplied. Perhaps you meant https://YOUR_ENDPOINT?
Error occurs when calling LLM.
Running Win11.

Error making API request: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I followed the Getting Started steps to configure the OpenAI endpoint, but encountered an error during execution.
Error making API request: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In the config.yml, I ONLY modified the following parameters:
OPENAI_API_BASE: "https://api.openai.com/v1/chat/completions"
OPENAI_API_KEY: "###"

Could anybody tell me why and how to solve it ?

Train or fine-tune models for computer automation agents

Hello there Microsoft UFO Team! Excellent work for you to do such remarkable job, bringing AI closer to Windows system. I am doing similar works like training custom GPT2 models on computer automation datasets.

I have created two comprehensive datasets, over terminal and GUI environments. My strategy is to create data by random keyboard and mouse actions, collect observations mixed with other textual datasets.

This naive attempt shows my strong interest over computer agents. I like the idea of GUI agent benchmark systems like WindowsBench, and have thought of building some reward system by program exit codes or VimGolf.

If you ever consider my suggestion useful I would like to hear from your reply! Furthermore, if cooperation is possible I would be thrilled to join your team for building better computer agents!


Update: Google has posted an unsupervised action space training method called Genie. Consider that as highly applicable in the area of computer agents.

How to get all user requests

How to get all user requests in your paper.
And how many user requests are there in the following applications?
“Outlook, Photos, PowerPoint, Word, Adobe Acrobat, File Explorer, Visual Studio Code, WeChat, Edge Browser, and cross-app”

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.