Starting off as a muggle that naïve to the Math's and Data Science world.

ScreenAgent : A Vision Language Model-driven Computer Control Agent

ref: https://arxiv.org/pdf/2402.07945

github: https://github.com/niuzaisheng/ScreenAgent

Summary

Performed end2end LLM agent development by constructing a real desktop interaction environment through VNC, enabling the agent to perceive screenshots and issue mouse and keyboard actions.

A UI automation process was introduced, with actions formalized as function calls and organized into planning, action, and reflect loops. Within the acting and reflecting steps, the agent decomposes tasks, executes actions, and decides whether to proceed, retry, or replan based on screen feedback.

Interaction traces of “screen–action” sequences were then collected to support evaluation using fine-grained metrics that emphasize action correctness and UI precision, and later to train a new model better from GPT-4V base.

Unlike HTML-dependent approaches, this method operates directly on pixels and generic input events, targeting device-agnostic UI agents.


Planning phase prompt
You are familiar with the Linux operating system.
You can see a computer screen with height: {{ video_height }}, width: {{ video_width }}, and the current task is "{{ task_prompt }}", you need to give a plan to accomplish this goal.

Please output your plan in json format, e.g. my task is to search the web for "What’s the deal with the Wheat Field Circle?", the steps to disassemble this task are:
```json[
   {"action_type": "PlanAction", "element": "Open web browser."},
   {"action_type": "PlanAction", "element": "Search in your browser for \"What’s the deal with the Wheat Field Circle?\""},
   {"action_type": "PlanAction", "element": "Open the first search result"},
   {"action_type": "PlanAction", "element": "Browse the content of the page"},
   {"action_type": "PlanAction", "element": "Answer the question \"What’s the deal with the Wheat Field Circle?\" according to the content."}
]```

Another example, my task is "Write a brief paragraph about artificial intelligence in a notebook", the steps to disassemble this task are:
```json
[
   {"action_type": "PlanAction", "element": "Open Notebook"},
   {"action_type": "PlanAction", "element": "Write a brief paragraph about AI in the notebook"}
]
```

{% if advice_ %}
Here are some suggestions for making a plan: {{ advice_ }}
{% endif %}

Now, your current task is "{{ task_prompt }}", give the disassembly steps of the task based on the state of the existing screen image.
Acting phase sends prompt
You’re very familiar with the Linux operating system and UI operations. Now you need to use the Linux operating system to complete a mission.
Your goal now is to manipulate a computer screen, video width: {{ video_width }}, video height: {{ video_height }}, the overall mission is: "{{ task_prompt }}".

We have developed an implementation plan for this overall mission:
{% for item in sub_task_list %}
   {{ loop.index }}. {{ item }}
{% endfor %}

The current subtask is "{{ current_task }}".

You can use the mouse and keyboard, the optional actions are:
```json
[
   {"action_type":"MouseAction","mouse_action_type":"click","mouse_button":"left","mouse_position":{"width":int,"height":int} },
   {"action_type":"MouseAction","mouse_action_type":"double_click","mouse_button":"left","mouse_position":{"width":int,"height":int} },
   {"action_type":"MouseAction","mouse_action_type":"scroll_up","scroll_repeat":int},
   {"action_type":"MouseAction","mouse_action_type":"scroll_down","scroll_repeat":int},
   {"action_type":"MouseAction","mouse_action_type":"move","mouse_position":{"width":int, "height":int} },
   {"action_type":"MouseAction","mouse_action_type":"drag","mouse_button":"left","mouse_position": "width":int,"height":int} },
   {"action_type":"KeyboardAction","keyboard_action_type":"press","keyboard_key":"KeyName in keysymdef"},
   {"action_type":"KeyboardAction","keyboard_action_type":"press","keyboard_key":"Ctrl+A"},
   {"action_type":"KeyboardAction","keyboard_action_type":"text","keyboard_text": "Hello, world!"},
   {"action_type":"WaitAction","wait_time":float}
]
```

Where the mouse position is relative to the top-left corner of the screen, and the keyboard keys are described in [keysymdef.h].

Please make output execution actions, please format them in json, e.g. My plan is to click the Start button, it’s on the left bottom corner, so my action will be:
```json
[
   {"action_type":"MouseAction","mouse_action_type":"click","mouse_button":"left","mouse_position":{"width":10,"height":760} }
]
```

Another example, my plan is to open Notepad and I see Mousepad app on the screen, so my action will be:
```json
[
   {"action_type":"MouseAction","mouse_action_type":"double_click","mouse_button":"left","mouse_position":{"width":60,"height":135} }
]
```

{% if advice_ %}
Here are some suggestions for performing this subtask: "{{ advice_ }}".
{% endif %}

The current subtask is "{{ current_task }}", please give the detailed next actions based on the state of the existing screen image.
Reflecting phase sends prompt
You’re very familiar with the Linux operating system and UI operations. 
Your current goal is to act as a reward model to judge whether or not this image meets the goal, video width: {{ video_width }}, video height: {{ video_height }}, the overall mission is: "{{ task_prompt }}".

We have developed an implementation plan for this overall mission:

{% for item in sub_task_list %}
   {{ loop.index }}. {{ item }}
{% endfor %}

Now the current subtask is: "{{ current_task }}".

Please describe whether or not this image meets the current subtask, please answer json format:
Here are a few options, if you think the current subtask is done well, then output this:
```json 
{"action_type":"EvaluateSubTaskAction", "situation": "sub_task_success"} 
```

The mission will go on.

If you think the current subtask is not done well, need to retry, then output this:
```json 
{"action_type":"EvaluateSubTaskAction", "situation": "need_retry", "advice": ""I don’t think you’re clicking in the right place."} 
```

You can give some suggestions for implementation improvements in the "advice" field.
If you feel that the whole plan does not match the current situation and you need to reformulate the implementation plan, please output:
```json 
{"action_type":"EvaluateSubTaskAction", "situation": "need_reformulate", "advice": "I think the current plan is not suitable for the current situation, because the system does not have .... installed"} 
```

You can give some suggestions for reformulating the plan in the "advice" field. Please surround the json output with the symbols "```json" and "```".

The current goal is: "{{ task_prompt }}", please describe whether or not this image meets the goal in json format? And whether or not our mission can continue.