Giter VIP home page Giter VIP logo

gym_screen_task's Introduction

gym_screen_task

brainstormed subtasks:

[ ] find a button

[x] click a button

[ ] click the right button of many

[x] drag a slider

[ ] click a slider

[x] press keyboard button

[ ] draw a figure

[ ] track a target with the mouse

[ ] press a button with timing

[ ] move a cursor object with keyboard

[ ] move the viewport with the keyboard

[ ] move the viewport with the mouse

[ ] drag select multiple objects with mouse

[ ] avoid object with cursor/mouse

[ ] copy text

[ ] write into textfield

[ ] paste text

[ ] drag the viewport with mouse

[ ] let the mouse get captured in a field

[ ] exit the mouse capture in a field

[ ] navigate through elements with keyboard

design:

its implemented as a gym environment, so you can use the simple gym.make, step, reset, obs and actions

the observation space is:

"screen": rgb pixels, with an adjustable resolution

"task_description": a textual description of the task, or subtask with optionally textual hints/prompts for a vlm

note, absolute mouse movement is ignored for now

the action space is:

"keyboard": printable ascii buttons

"mouse_buttons": left, right, scroll-click

"mouse_rel_move": relative mouse movement

"mouse_abs_move": absolute mouse movement (set mouse position in px)

"mouse_scroll": relative mouse scroll-wheel movement

additional data:

"semantic space": is an image of size screen, that represents semantic classes of ui elements: "text", "button", "image", ...

maybe we could use some kind of "ui-element-graph" to highly compress the screen semantics

we also need a format for an "screen-action-trace" this traces the actions for every frame

related work

openai vpt: https://arxiv.org/pdf/2206.11795.pdf

https://github.com/mlfoundations/open_clip

https://github.com/mlfoundations/open_flamingo/tree/main

other stuff

this was initially generated by cookiecutter and the https://github.com/flowpoint/pyproject template

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.