Giter VIP home page Giter VIP logo

self-operating-computer's Introduction

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs of a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.

Key Features

  • Compatibility: Designed for various multimodal models.
  • Integration: Currently integrated with GPT-4v as the default model.
  • Future Plans: Support for additional models.

Current Challenges

Note: GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

Ongoing Development

At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.

Agent-1-Vision Model API Access

We will soon be offering API access to our Agent-1-Vision model.

If you're interested in gaining access to this API, sign up here.

Additional Thoughts

We recognize that some operating system functions may be more efficiently executed with hotkeys such as entering the Browser Address bar using command + L rather than by simulating a mouse click at the correct XY location. We plan to make these improvements over time. However, it's important to note that many actions require the accurate selection of visual elements on the screen, necessitating precise XY mouse click locations. A primary focus of this project is to refine the accuracy of determining these click locations. We believe this is essential for achieving a fully self-operating computer in the current technological landscape.

Demo

final-low.mp4

Quick Start Instructions

Below are instructions to set up the Self-Operating Computer Framework locally on your computer.

  1. Clone the repo to a directory on your computer:
git clone https://github.com/OthersideAI/self-operating-computer.git
  1. Cd into directory:
cd self-operating-computer
  1. Create a Python virtual environment. Learn more about Python virtual environment.
python3 -m venv venv
  1. Activate the virtual environment:
source venv/bin/activate
  1. Install the project requirements:
pip install -r requirements.txt
  1. Install Project and Command-Line Interface:
pip install .
  1. Then rename the .example.env file to .env so that you can save your OpenAI key in it.
mv .example.env .env
  1. Add your Open AI key to your new .env file. If you don't have one, you can obtain an OpenAI key here:
OPENAI_API_KEY='your-key-here'
  1. Run it!
operate
  1. Final Step: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Contributions are Welcomed! Some Ideas:

  • Improve performance by finding optimal screenshot grid: A primary element of the framework is that it overlays a percentage grid on the screenshot which GPT-4v uses to estimate click locations. If someone is able to find the optimal grid and some evaluation metrics to confirm it is an improvement on the current method then we will merge that PR.
  • Improve the SUMMARY_PROMPT
  • Create an evaluation system
  • Improve Linux and Windows compatibility: There are still some issues with Linux and Windows compatibility. PRs to fix the issues are encouraged.
  • Enabling New Mouse Capabilities: (drag, hover, etc.)
  • Adding New Multimodal Models: Integration of new multimodal models is welcomed. If you have a specific model in mind that you believe would be a valuable addition, please feel free to integrate it and submit a PR.
  • Framework Architecture Improvements: Think you can enhance the framework architecture described in the intro? We welcome suggestions and PRs.
  • Implement a Reflective Mouse Click Mode: Introduce a new mode that enhances click accuracy by adding a 'reflect and correct' step. In this mode, the system will 'move mouse, reflect on position, and click if accurate; otherwise, adjust position closer.' This approach, more akin to human interaction, could increase accuracy before the implementation of Agent-1-vision for precise clicking. The main challenge is the increased time due to current multimodal model latency. We propose an optional -accurate terminal flag to activate this mode. This feature has the potential to significantly boost performance and offers an interesting area for development.

For any input on improving this project, feel free to reach out to Josh on Twitter. If you want to contribute yourself, see CONTRIBUTING.md.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

Compatibility

  • This project is compatible with Mac OS, Windows, and Linux (with X server installed).

Star History

Star History Chart

self-operating-computer's People

Contributors

joshbickett avatar michaelhhogue avatar mshumer avatar justindhillon avatar horw avatar frityet avatar eltociear avatar ronnachum11 avatar shubhexists avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.