[ ] find a button
[x] click a button
[ ] click the right button of many
[x] drag a slider
[ ] click a slider
[x] press keyboard button
[ ] draw a figure
[ ] track a target with the mouse
[ ] press a button with timing
[ ] move a cursor object with keyboard
[ ] move the viewport with the keyboard
[ ] move the viewport with the mouse
[ ] drag select multiple objects with mouse
[ ] avoid object with cursor/mouse
[ ] copy text
[ ] write into textfield
[ ] paste text
[ ] drag the viewport with mouse
[ ] let the mouse get captured in a field
[ ] exit the mouse capture in a field
[ ] navigate through elements with keyboard
its implemented as a gym environment, so you can use the simple gym.make, step, reset, obs and actions
the observation space is:
"screen": rgb pixels, with an adjustable resolution
"task_description": a textual description of the task, or subtask with optionally textual hints/prompts for a vlm
note, absolute mouse movement is ignored for now
the action space is:
"keyboard": printable ascii buttons
"mouse_buttons": left, right, scroll-click
"mouse_rel_move": relative mouse movement
"mouse_abs_move": absolute mouse movement (set mouse position in px)
"mouse_scroll": relative mouse scroll-wheel movement
"semantic space": is an image of size screen, that represents semantic classes of ui elements: "text", "button", "image", ...
maybe we could use some kind of "ui-element-graph" to highly compress the screen semantics
we also need a format for an "screen-action-trace" this traces the actions for every frame
openai vpt: https://arxiv.org/pdf/2206.11795.pdf
https://github.com/mlfoundations/open_clip
https://github.com/mlfoundations/open_flamingo/tree/main
this was initially generated by cookiecutter and the https://github.com/flowpoint/pyproject template