1. config.yaml 文件准备
模型用的是字节跳动的,链接:https://www.volcengine.com/
MODEL: "OpenAI" # The type of multi-modal LLM you would like to use to power the AppAgent, must be either OpenAI or Qwen
OPENAI_API_BASE: "https://ark.cn-beijing.volces.com/api/v3/chat/completions"
OPENAI_API_KEY: "your api key" # Set the value to sk-xxx if you host the openai interface for open llm model
OPENAI_API_MODEL: "your model name" # The only OpenAI model by now that accepts visual input
MAX_TOKENS: 300 # The max token limit for the response completion
TEMPERATURE: 0.0 # The temperature of the model: the lower the value, the more consistent the output of the model
REQUEST_INTERVAL: 10 # Time in seconds between consecutive GPT-4V requests
DASHSCOPE_API_KEY: "sk-" # The dashscope API key that gives you access to Qwen-VL model
QWEN_MODEL: "qwen-vl-max"
ANDROID_SCREENSHOT_DIR: "/sdcard" # Set the directory on your Android device to store the intermediate screenshots. Make sure the directory EXISTS on your phone!
ANDROID_XML_DIR: "/sdcard" # Set the directory on your Android device to store the intermediate XML files used for determining locations of UI elements on your screen. Make sure the directory EXISTS on your phone!
DOC_REFINE: false # Set this to true will make the agent refine existing documentation based on the latest demonstration; otherwise, the agent will not regenerate a new documentation for elements with the same resource ID.
MAX_ROUNDS: 20 # Set the round limit for the agent to complete the task
DARK_MODE: false # Set this to true if your app is in dark mode to enhance the element labeling
MIN_DIST: 30 # The minimum distance between elements to prevent overlapping during the labeling process
2. 载入配置文件
import os
import yaml
def load_config(config_path="./config.yaml"):
configs = dict(os.environ)
with open(config_path, "r") as file:
yaml_data = yaml.safe_load(file)
configs.update(yaml_data)
return configs
configs = load_config()
3. 模型定义
# 导入必要的模块
from abc import abstractmethod # 用于定义抽象方法
from typing import List # 用于类型注解,表示列表类型
import requests # 用于发送HTTP请求
from utils import print_with_color, encode_image # 导入自定义工具函数
# 定义一个抽象基类 BaseModel
class BaseModel:
def __init__(self):
pass
# 定义一个抽象方法,子类必须实现该方法
@abstractmethod
def get_model_response(self, prompt: str, images: List[str]) -> (bool, str):
pass
# 定义一个继承自 BaseModel 的 OpenAIModel 类
class OpenAIModel(BaseModel):
def __init__(
self,
base_url: str, # API 的基础URL
api_key: str, # API 的认证密钥
model: str, # 使用的模型名称
temperature: float, # 生成文本的随机性控制参数
max_tokens: int, # 生成文本的最大长度
):
super().__init__() # 调用父类的初始化方法
self.base_url = base_url
self.api_key = api_key
self.model = model
self.temperature = temperature
self.max_tokens = max_tokens
# 实现抽象方法,用于获取模型的响应
def get_model_response(self, prompt: str, images: List[str]) -> (bool, str):
# 构建请求内容,初始包含文本
content = [{"type": "text", "text": prompt}]
# 遍历图片列表,将每张图片编码为 base64 并添加到内容中
for img in images:
base64_img = encode_image(img)
content.append(
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_img}"},
}
)
# 设置请求头,包括内容类型和认证信息
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}",
}
# 构建请求体,包括模型名称、消息内容、温度参数和最大 token 数
payload = {
"model": self.model,
"messages": [{"role": "user", "content": content}],
"temperature": self.temperature,
"max_tokens": self.max_tokens,
}
# 发送 POST 请求并获取响应
response = requests.post(self.base_url, headers=headers, json=payload).json()
# 检查响应中是否包含错误信息
if "error" not in response:
# 如果没有错误,提取 token 使用情况并计算请求成本
usage = response["usage"]
prompt_tokens = usage["prompt_tokens"]
completion_tokens = usage["completion_tokens"]
print_with_color(
f"Request cost is "
f"${'{0:.2f}'.format(prompt_tokens / 1000 * 0.01 + completion_tokens / 1000 * 0.03)}",
"yellow",
)
else:
# 如果有错误,返回 False 和错误信息
return False, response["error"]["message"]
# 返回 True 和模型生成的文本内容
return True, response["choices"][0]["message"]["content"]
4. 实例化模型
if configs["MODEL"] == "OpenAI":
mllm = OpenAIModel(base_url=configs["OPENAI_API_BASE"],
api_key=configs["OPENAI_API_KEY"],
model=configs["OPENAI_API_MODEL"],
temperature=configs["TEMPERATURE"],
max_tokens=configs["MAX_TOKENS"])
5. 示例
示例图片:
result = mllm.get_model_response("解释这张图片", [r"D:\mobile agent\AppAgent\scripts\apps\gaode\demos\self_explore_2024-12-26_03-10-14\1_before_labeled.png"])
result
[33mRequest cost is $0.02
[0m
(True,
'这张图片展示了一个手机屏幕的主界面,背景是一只金色的小龙,形象非常逼真,带有锐利的角和鳞片,显得神秘而可爱。屏幕上有多个应用程序图标,排列在一个网格中。\n\n在屏幕的顶部,显示了当前时间为3:10,右上角有一些状态图标,包括蓝牙、HD标志和电池电量(45%)。以下是应用程序图标的详细描述:\n\n- **Amap(21号图标)**:位于左上角,图标为蓝色和黄色的组合,形状像一个纸飞机。\n- **Clock(8号图标)**:在Amap下方,图标为黑色和橙色的组合,带有一个时钟图案。\n- **Calculator(10号图标)**:在Clock下方,图标为黑色和橙色的组合,带有计算器的图案。\n- **Settings(11号图标)**:在Calculator右侧,图标为黑色和橙色的组合,带有齿轮图案。\n- **Notes(9号图标)**:在Clock右侧,图标为黑色和橙色的组合,带有“Note”字样。\n- **Calendar(12号图标)**:在Settings右侧,图标为黑色和橙色的组合,带有日历图案。\n- **Weather(13号图标)**:在Calendar右侧,图标为黑色和橙色的组合,带有天气图案。\n\n屏幕底部还有几个')
6. AppAgent 提示词
6.1 提示词模板
self_explore_task_template = """You are an agent that is trained to complete certain tasks on a smartphone. You will be
given a screenshot of a smartphone app. The interactive UI elements on the screenshot are labeled with numeric tags
starting from 1.
You can call the following functions to interact with those labeled elements to control the smartphone:
1. tap(element: int)
This function is used to tap an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be tap(5), which taps the UI element labeled with the number 5.
2. text(text_input: str)
This function is used to insert text input in an input field/box. text_input is the string you want to insert and must
be wrapped with double quotation marks. A simple use case can be text("Hello, world!"), which inserts the string
"Hello, world!" into the input area on the smartphone screen. This function is only callable when you see a keyboard
showing in the lower half of the screen.
3. long_press(element: int)
This function is used to long press an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be long_press(5), which long presses the UI element labeled with the number 5.
4. swipe(element: int, direction: str, dist: str)
This function is used to swipe an UI element shown on the smartphone screen, usually a scroll view or a slide bar.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen. "direction" is a string that
represents one of the four directions: up, down, left, right. "direction" must be wrapped with double quotation
marks. "dist" determines the distance of the swipe and can be one of the three options: short, medium, long. You should
choose the appropriate distance option according to your need.
A simple use case can be swipe(21, "up", "medium"), which swipes up the UI element labeled with the number 21 for a
medium distance.
The task you need to complete is to <task_description>. Your past actions to proceed with this task are summarized as
follows: <last_act>
Now, given the following labeled screenshot, you need to think and call the function needed to proceed with the task.
Your output should include three parts in the given format:
Observation: <Describe what you observe in the image>
Thought: <To complete the given task, what is the next step I should do>
Action: <The function call with the correct parameters to proceed with the task. If you believe the task is completed or
there is nothing to be done, you should output FINISH. You cannot output anything else except a function call or FINISH
in this field.>
Summary: <Summarize your past actions along with your latest action in one or two sentences. Do not include the numeric
tag in your summary>
You can only take one action at a time, so please directly call the function."""
6.2 提示词中添加任务
task_desc = "open the setting"
import re
last_act = "None"
prompt = re.sub(r"<task_description>", task_desc, self_explore_task_template)
prompt = re.sub(r"<last_act>", last_act, prompt)
print(prompt)
You are an agent that is trained to complete certain tasks on a smartphone. You will be
given a screenshot of a smartphone app. The interactive UI elements on the screenshot are labeled with numeric tags
starting from 1.
You can call the following functions to interact with those labeled elements to control the smartphone:
1. tap(element: int)
This function is used to tap an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be tap(5), which taps the UI element labeled with the number 5.
2. text(text_input: str)
This function is used to insert text input in an input field/box. text_input is the string you want to insert and must
be wrapped with double quotation marks. A simple use case can be text("Hello, world!"), which inserts the string
"Hello, world!" into the input area on the smartphone screen. This function is only callable when you see a keyboard
showing in the lower half of the screen.
3. long_press(element: int)
This function is used to long press an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be long_press(5), which long presses the UI element labeled with the number 5.
4. swipe(element: int, direction: str, dist: str)
This function is used to swipe an UI element shown on the smartphone screen, usually a scroll view or a slide bar.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen. "direction" is a string that
represents one of the four directions: up, down, left, right. "direction" must be wrapped with double quotation
marks. "dist" determines the distance of the swipe and can be one of the three options: short, medium, long. You should
choose the appropriate distance option according to your need.
A simple use case can be swipe(21, "up", "medium"), which swipes up the UI element labeled with the number 21 for a
medium distance.
The task you need to complete is to open the setting. Your past actions to proceed with this task are summarized as
follows: None
Now, given the following labeled screenshot, you need to think and call the function needed to proceed with the task.
Your output should include three parts in the given format:
Observation: <Describe what you observe in the image>
Thought: <To complete the given task, what is the next step I should do>
Action: <The function call with the correct parameters to proceed with the task. If you believe the task is completed or
there is nothing to be done, you should output FINISH. You cannot output anything else except a function call or FINISH
in this field.>
Summary: <Summarize your past actions along with your latest action in one or two sentences. Do not include the numeric
tag in your summary>
You can only take one action at a time, so please directly call the function.
6.3 运行
result = mllm.get_model_response(prompt, [r"D:\mobile agent\AppAgent\scripts\apps\gaode\demos\self_explore_2024-12-26_03-10-14\1_before_labeled.png"])
result
[33mRequest cost is $0.02
[0m
(True,
'Observation: The screenshot shows a smartphone home screen with a dragon image as the wallpaper. There are several app icons arranged in a grid pattern, including AMap, Clock, Notes, Calculator, Settings, Calendar, Weather, and others. The time displayed at the top left corner is 3:10, and the battery level is 45%.\nThought: To open the settings, I need to tap on the Settings app icon which is labeled with the number 11.\nAction: tap(11)\nSummary: I tapped on the Settings app icon to open the settings.')
标签:AppAgent,smartphone,self,element,源码,UI,OpenAIModel,screen,图标
From: https://blog.csdn.net/qq_41472205/article/details/144751613