首页 > 编程语言 >AppAgent源码 (OpenAIModel 类)

AppAgent源码 (OpenAIModel 类)

时间:2024-12-26 19:56:06浏览次数:5  
标签:AppAgent smartphone self element 源码 UI OpenAIModel screen 图标

1. config.yaml 文件准备

模型用的是字节跳动的,链接:https://www.volcengine.com/

MODEL: "OpenAI"  # The type of multi-modal LLM you would like to use to power the AppAgent, must be either OpenAI or Qwen

OPENAI_API_BASE: "https://ark.cn-beijing.volces.com/api/v3/chat/completions"
OPENAI_API_KEY: "your api key"  # Set the value to sk-xxx if you host the openai interface for open llm model
OPENAI_API_MODEL: "your model name"  # The only OpenAI model by now that accepts visual input
MAX_TOKENS: 300  # The max token limit for the response completion
TEMPERATURE: 0.0  # The temperature of the model: the lower the value, the more consistent the output of the model
REQUEST_INTERVAL: 10  # Time in seconds between consecutive GPT-4V requests

DASHSCOPE_API_KEY: "sk-"  # The dashscope API key that gives you access to Qwen-VL model
QWEN_MODEL: "qwen-vl-max"

ANDROID_SCREENSHOT_DIR: "/sdcard"  # Set the directory on your Android device to store the intermediate screenshots. Make sure the directory EXISTS on your phone!
ANDROID_XML_DIR: "/sdcard"  # Set the directory on your Android device to store the intermediate XML files used for determining locations of UI elements on your screen. Make sure the directory EXISTS on your phone!

DOC_REFINE: false  # Set this to true will make the agent refine existing documentation based on the latest demonstration; otherwise, the agent will not regenerate a new documentation for elements with the same resource ID.
MAX_ROUNDS: 20  # Set the round limit for the agent to complete the task
DARK_MODE: false  # Set this to true if your app is in dark mode to enhance the element labeling
MIN_DIST: 30  # The minimum distance between elements to prevent overlapping during the labeling process

2. 载入配置文件

import os
import yaml


def load_config(config_path="./config.yaml"):
    configs = dict(os.environ)
    with open(config_path, "r") as file:
        yaml_data = yaml.safe_load(file)
    configs.update(yaml_data)
    return configs

configs = load_config()

3. 模型定义

# 导入必要的模块
from abc import abstractmethod  # 用于定义抽象方法
from typing import List  # 用于类型注解,表示列表类型
import requests  # 用于发送HTTP请求
from utils import print_with_color, encode_image  # 导入自定义工具函数


# 定义一个抽象基类 BaseModel
class BaseModel:
    def __init__(self):
        pass

    # 定义一个抽象方法,子类必须实现该方法
    @abstractmethod
    def get_model_response(self, prompt: str, images: List[str]) -> (bool, str):
        pass


# 定义一个继承自 BaseModel 的 OpenAIModel 类
class OpenAIModel(BaseModel):
    def __init__(
        self,
        base_url: str,  # API 的基础URL
        api_key: str,  # API 的认证密钥
        model: str,  # 使用的模型名称
        temperature: float,  # 生成文本的随机性控制参数
        max_tokens: int,  # 生成文本的最大长度
    ):
        super().__init__()  # 调用父类的初始化方法
        self.base_url = base_url
        self.api_key = api_key
        self.model = model
        self.temperature = temperature
        self.max_tokens = max_tokens

    # 实现抽象方法,用于获取模型的响应
    def get_model_response(self, prompt: str, images: List[str]) -> (bool, str):
        # 构建请求内容,初始包含文本
        content = [{"type": "text", "text": prompt}]
        # 遍历图片列表,将每张图片编码为 base64 并添加到内容中
        for img in images:
            base64_img = encode_image(img)
            content.append(
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_img}"},
                }
            )
        # 设置请求头,包括内容类型和认证信息
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}",
        }
        # 构建请求体,包括模型名称、消息内容、温度参数和最大 token 数
        payload = {
            "model": self.model,
            "messages": [{"role": "user", "content": content}],
            "temperature": self.temperature,
            "max_tokens": self.max_tokens,
        }
        # 发送 POST 请求并获取响应
        response = requests.post(self.base_url, headers=headers, json=payload).json()
        # 检查响应中是否包含错误信息
        if "error" not in response:
            # 如果没有错误,提取 token 使用情况并计算请求成本
            usage = response["usage"]
            prompt_tokens = usage["prompt_tokens"]
            completion_tokens = usage["completion_tokens"]
            print_with_color(
                f"Request cost is "
                f"${'{0:.2f}'.format(prompt_tokens / 1000 * 0.01 + completion_tokens / 1000 * 0.03)}",
                "yellow",
            )
        else:
            # 如果有错误,返回 False 和错误信息
            return False, response["error"]["message"]
        # 返回 True 和模型生成的文本内容
        return True, response["choices"][0]["message"]["content"]

4. 实例化模型

if configs["MODEL"] == "OpenAI":
    mllm = OpenAIModel(base_url=configs["OPENAI_API_BASE"],
                       api_key=configs["OPENAI_API_KEY"],
                       model=configs["OPENAI_API_MODEL"],
                       temperature=configs["TEMPERATURE"],
                       max_tokens=configs["MAX_TOKENS"])

5. 示例

示例图片:
在这里插入图片描述

result = mllm.get_model_response("解释这张图片", [r"D:\mobile agent\AppAgent\scripts\apps\gaode\demos\self_explore_2024-12-26_03-10-14\1_before_labeled.png"])
result
[33mRequest cost is $0.02
[0m





(True,
 '这张图片展示了一个手机屏幕的主界面,背景是一只金色的小龙,形象非常逼真,带有锐利的角和鳞片,显得神秘而可爱。屏幕上有多个应用程序图标,排列在一个网格中。\n\n在屏幕的顶部,显示了当前时间为3:10,右上角有一些状态图标,包括蓝牙、HD标志和电池电量(45%)。以下是应用程序图标的详细描述:\n\n- **Amap(21号图标)**:位于左上角,图标为蓝色和黄色的组合,形状像一个纸飞机。\n- **Clock(8号图标)**:在Amap下方,图标为黑色和橙色的组合,带有一个时钟图案。\n- **Calculator(10号图标)**:在Clock下方,图标为黑色和橙色的组合,带有计算器的图案。\n- **Settings(11号图标)**:在Calculator右侧,图标为黑色和橙色的组合,带有齿轮图案。\n- **Notes(9号图标)**:在Clock右侧,图标为黑色和橙色的组合,带有“Note”字样。\n- **Calendar(12号图标)**:在Settings右侧,图标为黑色和橙色的组合,带有日历图案。\n- **Weather(13号图标)**:在Calendar右侧,图标为黑色和橙色的组合,带有天气图案。\n\n屏幕底部还有几个')

6. AppAgent 提示词

6.1 提示词模板

self_explore_task_template = """You are an agent that is trained to complete certain tasks on a smartphone. You will be 
given a screenshot of a smartphone app. The interactive UI elements on the screenshot are labeled with numeric tags 
starting from 1. 

You can call the following functions to interact with those labeled elements to control the smartphone:

1. tap(element: int)
This function is used to tap an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be tap(5), which taps the UI element labeled with the number 5.

2. text(text_input: str)
This function is used to insert text input in an input field/box. text_input is the string you want to insert and must 
be wrapped with double quotation marks. A simple use case can be text("Hello, world!"), which inserts the string 
"Hello, world!" into the input area on the smartphone screen. This function is only callable when you see a keyboard 
showing in the lower half of the screen.

3. long_press(element: int)
This function is used to long press an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be long_press(5), which long presses the UI element labeled with the number 5.

4. swipe(element: int, direction: str, dist: str)
This function is used to swipe an UI element shown on the smartphone screen, usually a scroll view or a slide bar.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen. "direction" is a string that 
represents one of the four directions: up, down, left, right. "direction" must be wrapped with double quotation 
marks. "dist" determines the distance of the swipe and can be one of the three options: short, medium, long. You should 
choose the appropriate distance option according to your need.
A simple use case can be swipe(21, "up", "medium"), which swipes up the UI element labeled with the number 21 for a 
medium distance.

The task you need to complete is to <task_description>. Your past actions to proceed with this task are summarized as 
follows: <last_act>
Now, given the following labeled screenshot, you need to think and call the function needed to proceed with the task. 
Your output should include three parts in the given format:
Observation: <Describe what you observe in the image>
Thought: <To complete the given task, what is the next step I should do>
Action: <The function call with the correct parameters to proceed with the task. If you believe the task is completed or 
there is nothing to be done, you should output FINISH. You cannot output anything else except a function call or FINISH 
in this field.>
Summary: <Summarize your past actions along with your latest action in one or two sentences. Do not include the numeric 
tag in your summary>
You can only take one action at a time, so please directly call the function."""

6.2 提示词中添加任务

task_desc = "open the setting"
import re
last_act = "None"
prompt = re.sub(r"<task_description>", task_desc, self_explore_task_template)
prompt = re.sub(r"<last_act>", last_act, prompt)
print(prompt)
You are an agent that is trained to complete certain tasks on a smartphone. You will be 
given a screenshot of a smartphone app. The interactive UI elements on the screenshot are labeled with numeric tags 
starting from 1. 

You can call the following functions to interact with those labeled elements to control the smartphone:

1. tap(element: int)
This function is used to tap an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be tap(5), which taps the UI element labeled with the number 5.

2. text(text_input: str)
This function is used to insert text input in an input field/box. text_input is the string you want to insert and must 
be wrapped with double quotation marks. A simple use case can be text("Hello, world!"), which inserts the string 
"Hello, world!" into the input area on the smartphone screen. This function is only callable when you see a keyboard 
showing in the lower half of the screen.

3. long_press(element: int)
This function is used to long press an UI element shown on the smartphone screen.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen.
A simple use case can be long_press(5), which long presses the UI element labeled with the number 5.

4. swipe(element: int, direction: str, dist: str)
This function is used to swipe an UI element shown on the smartphone screen, usually a scroll view or a slide bar.
"element" is a numeric tag assigned to an UI element shown on the smartphone screen. "direction" is a string that 
represents one of the four directions: up, down, left, right. "direction" must be wrapped with double quotation 
marks. "dist" determines the distance of the swipe and can be one of the three options: short, medium, long. You should 
choose the appropriate distance option according to your need.
A simple use case can be swipe(21, "up", "medium"), which swipes up the UI element labeled with the number 21 for a 
medium distance.

The task you need to complete is to open the setting. Your past actions to proceed with this task are summarized as 
follows: None
Now, given the following labeled screenshot, you need to think and call the function needed to proceed with the task. 
Your output should include three parts in the given format:
Observation: <Describe what you observe in the image>
Thought: <To complete the given task, what is the next step I should do>
Action: <The function call with the correct parameters to proceed with the task. If you believe the task is completed or 
there is nothing to be done, you should output FINISH. You cannot output anything else except a function call or FINISH 
in this field.>
Summary: <Summarize your past actions along with your latest action in one or two sentences. Do not include the numeric 
tag in your summary>
You can only take one action at a time, so please directly call the function.

6.3 运行

result = mllm.get_model_response(prompt, [r"D:\mobile agent\AppAgent\scripts\apps\gaode\demos\self_explore_2024-12-26_03-10-14\1_before_labeled.png"])
result
[33mRequest cost is $0.02
[0m





(True,
 'Observation: The screenshot shows a smartphone home screen with a dragon image as the wallpaper. There are several app icons arranged in a grid pattern, including AMap, Clock, Notes, Calculator, Settings, Calendar, Weather, and others. The time displayed at the top left corner is 3:10, and the battery level is 45%.\nThought: To open the settings, I need to tap on the Settings app icon which is labeled with the number 11.\nAction: tap(11)\nSummary: I tapped on the Settings app icon to open the settings.')

标签:AppAgent,smartphone,self,element,源码,UI,OpenAIModel,screen,图标
From: https://blog.csdn.net/qq_41472205/article/details/144751613

相关文章

  • 基于Spring Boot的小型医院医疗设备管理系统的设计与实现(LW+源码+讲解)
    专注于大学生项目实战开发,讲解,毕业答疑辅导,欢迎高校老师/同行前辈交流合作✌。技术范围:SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、安卓app、大数据、物联网、机器学习等设计与开发。主要内容:免费功能设计、开题报告、任务书、中......
  • 基于Spring Boot的知名作家信息管理系统的设计与实现(LW+源码+讲解)
    专注于大学生项目实战开发,讲解,毕业答疑辅导,欢迎高校老师/同行前辈交流合作✌。技术范围:SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、安卓app、大数据、物联网、机器学习等设计与开发。主要内容:免费功能设计、开题报告、任务书、中......
  • UE4.27, 揣摩源码, 序列化 (二) FBitReader, FBitWriter
    2. 继续看bit序列化,这个设计是网络传输的关键一环//FBitReader, FBitWriter这两个类仅被网络相关的事务使用//Thisclassisexclusivelyusedbythenetcode2.1.SVO和array的正反序列化相同,内部都存在着,对类型是TAarry<uint8>的字节单位的内存的处理......
  • UE4.27, 揣摩源码, 序列化 (一) FArrayReader, FArrayWriter
    1.从ArrayReader.h和ArrayWriter.h开始1.1.SVO为了减少误解,介绍一下SVO这里的read和write的主词都是array,宾语都是memory所以前者是从内存读出array,后者是将array写入内存1.2.关键继承关系FArrayReader,FArrayWriterc......
  • UE4.27, 揣摩源码, 小展 "宏" 图
    ue宏乃催眠神器,睡不着就点进来看看罢1. CORE_API见PCH_Core.h有(PCH=Pre-CompiledHeader)#defineCORE_APIDLLEXPORT事实上,有许多名为XXXXXX_API的宏被定义为DLLEXPORT或DLLIMPORT好了,现在一个问题变成俩了 2. DLLEXPORT, DLLIMPORT如果你的分词功能和我一样被这个大......
  • 计算机毕业设计—51328 Springboot二手交易平台APP(源码免费领)
    摘要1绪论1.1开发背景1.2开发现状1.3springboot框架介绍1.4论文结构与章节安排2 Springboot二手交易平台APP系统分析2.1可行性分析2.1.1技术可行性分析2.1.2经济可行性分析2.1.3操作可行性分析2.2系统流程分析2.2.1数据流程2.2.2业务流程......
  • 计算机毕业设计—50966 党员信息管理系统的设计与实现(源码免费领)
    摘要1绪论1.1系统开发背景1.2系统发展趋势1.3研究方法1.4论文结构与章节安排2 党员信息管理系统系统分析2.1可行性分析2.1.1技术可行性分析2.1.2经济可行性分析2.1.3法律可行性分析2.2系统功能分析2.2.1功能性分析2.2.2非功能性分析2.3......
  • Springboot课程教学评估数据分析93o9j(程序+源码+数据库+调试部署+开发环境)
    本系统(程序+源码+数据库+调试部署+开发环境)带论文文档1万字以上,文末可获取,系统界面在最后面。系统程序文件列表教师,学生,教学评价,课程评价开题报告内容一、选题背景与意义随着互联网技术的不断发展和普及,教育行业正经历着前所未有的变革。其中,Springboot作为Java应用开......
  • 值得推荐的在线考试系统**免费分享源码**
    @目录摘要1.研究背景2.研究内容3.需求分析(项目设计目标)4.系统功能4.1用户登录界面功能模块4.2用户信息管理功能模块4.3考试信息功能模块4.4教师管理模块4.5考场功能模块5.部分功能代码实现6.源码分享(免费获取)摘要随着社会的发展,系统的管理形势越来越严峻。越来越多的用户利......
  • [免费]SpringBoot公益众筹爱心捐赠系统【论文+源码+SQL脚本】
    大家好,我是java1234_小锋老师,看到一个不错的SpringBoot公益众筹爱心捐赠系统,分享下哈。项目介绍公益捐助平台的发展背景可以追溯到几十年前,当时人们已经开始通过各种渠道进行公益捐助。随着互联网的普及,本文旨在探讨公益事业的发展趋势与挑战,特别是以社区发展为中心的公益......