application and ability
https://arxiv.org/pdf/2402.15116
LMAs, proficient in processing diverse data modalities, surpass language-only agents in decision-
making and response generation across varied scenarios. Their adaptability makes them exceptionally
useful in real-world, multisensory environments, as illustrated in Figure 4.
GUI Automation. In this application, the objective of LMAs is to understand and simulate human
actions within user interfaces, enabling the execution of repetitive tasks, navigation across multiple
applications, and the simplification of complex workflows. This automation holds the potential to
save users’ time and energy, allowing them to focus on the more critical and creative aspects of
their work [44, 6 , 53, 64 , 75 , 69 , 54, 17 , 8]. For example, GPT-4V-Act [6 ], is an advanced AI that
combines GPT-4V’s capabilities with web browsing to improve human-computer interactions. Its
main goal is to make user interfaces more accessible, simplify workflow automation, and enhance
automated UI testing. This AI is especially beneficial for people with disabilities or limited tech
skills, helping them navigate complex interfaces more easily.
8
Audio
Editing&Generation
MusicAgent
WavJourney
UI-assistants
MM-Navigator
WebWISE
AutoDroid
Embodied AI
&Robotics
STEVE EMMA
Octopus GRID
JARVIS-1 MP5 DEPS
Game-
developer
SMARTPLAY
VisualWebArena
Autonomous
Driving
GPT-Driver
DLAH
Complex Visual
Reasoning Tasks
VisProgram
Visual Chatgpt
HuggingGPT
MM-ReAct
Chameleon
MLLM-Tool
CRAFT
Avis
Video
Understanding
AssistGPT
DORAEMONGPT
ChatVideo
Visual Generation
& Editing
LLaVA-Interactive
MM-ReAct
Mulan
ViperGPT
GPT4Tools
Llava-plus
AssistGPT
M3
DDCoT
CLOVA
ASSISTGUI
DroidBot-GPT
Auto-UI
Copilot
AudioGPT
AppAgent
MemoDroid
GPT-4V-Act
DiscussNav
Mobile-Agent
OS-Copilo
OpenagentsFigure 4: A variety of applications of LMAs.
Robotics and Embodied AI. This application [ 37 , 51 , 68, 52, 45 , 65 , 79 ] focuses on integrating the
perceptual, reasoning, and action capabilities of robots with physical interactions in their environments.
Employing a multimodal agent, robots are enabled to utilize diverse sensory channels, such as vision,
audition, and touch, to acquire comprehensive environmental data. For example, the MP5 system
[37], is a cutting-edge, multimodal entity system used in Minecraft that utilizes active perception to
smartly break down and carry out extensive, indefinite tasks with large language models.
Game Developement. Game AI [ 58 , 16] endeavors to design and implement these agents to
exhibit intelligence and realism, thereby providing engaging and challenging player experiences. The
successful integration of agent technology in games has led to the creation of more sophisticated and
interactive virtual environments.
Autonomous Driving. Traditional approaches to autonomous vehicles [ 33] face obstacles in
effectively perceiving and interpreting complex scenarios. Recent progress in multimodal agent-
based technologies, notably driven by LLMs, marks a substantial advancement in overcoming these
challenges and bridging the perception gap [32 , 7 , 81 , 55 ]. [ 32] present GPT-Driver, a pioneering
approach that employs the OpenAI GPT-3.5 model as a reliable motion planner for autonomous
vehicles, with a specific focus on generating safe and comfortable driving trajectories. Harnessing
the inherent reasoning capabilities of LLMs, their method provides a promising solution to the issue
of limited generalization in novel driving scenarios.
Video Understanding. The video understanding agents [ 9, 71 ] are artificial intelligence systems
specifically designed for analyzing and comprehending video content. It utilizes deep learning
techniques to extract essential information from videos, identifying objects, actions, and scenes to
enhance understanding of the video content.
Visual Generation & Editing. Applications of this kind [ 4, 70 , 47] are designed for the creation
and manipulation of visual content. Using advanced technologies, this tool effortlessly creates and
modifies images, offering users a flexible option for creative projects. For instance, LLaVA-Interactive
[ 4] is an open-source multimodal interactive system that amalgamates the capabilities of pre-trained
AI models to facilitate multi-turn dialogues with visual cues and generate edited images, thereby
realizing a cost-effective, flexible, and intuitive AI-assisted visual content creation experience.
9
Complex Visual Reasoning Tasks. This area is a key focus in multimodal agent research, mainly
emphasizing the analysis of multimodal content. This prevalence is attributed to the superior cognitive
capabilities of LLMs in comprehending and reasoning through knowledge-based queries, surpassing
the capabilities of previous models [ 14 , 25, 80 ]. Within these applications, the primary focus is on
QA tasks [ 41 , 57 , 70, 30 ]. This entails leveraging visual modalities (images or videos) and textual
modalities (questions or questions with accompanying documents) for reasoned responses.
Audio Editing & Generation. The LMAs in this application integrate foundational expert models
in the audio domain, making the editing and creation of music efficient[77, 73,
Functions
https://arxiv.org/pdf/2306.13549
In order to further inspect what roles LLMs exactly play
in LLM-Aided Visual Reasoning systems, existing related
works are divided into three types:
• LLM as a Controller
• LLM as a Decision Maker
• LLM as a Semantics Refin
The first two roles are related to CoT (see §7.2). It is
frequently used because complex tasks need to be broken
down into intermediate simpler steps. When LLMs act as
controllers, the systems often finish the task in a single
round, while multi-round is more common in the case of the
decision maker. We delineate how LLMs serve these roles in
the following parts.
LLM as a Controller. In this case, LLMs act as a central
controller that (1) breaks down a complex task into simpler
sub-tasks/steps and (2) assigns these tasks to appropriate
tools/modules. The first step is often finished by leveraging
the CoT ability of LLMs. Specifically, LLMs are prompted
explicitly to output task planning [181] or, more directly, the
modules to call [107], [169], [170]. For example, VisProg [170]
prompts GPT-3 to output a visual program, where each
program line invokes a module to perform a sub-task. In
addition, LLMs are required to output argument names for
the module input. To handle these complex requirements,
some hand-crafted in-context examples are used as refer-
ences [169], [170], [181]. This is closely related to the opti-
mization of reasoning chains (see §7.2), or more specifically,
the least-to-most prompting [206] technique. In this way,
complex problems are broken down into sub-problems that
are solved sequentially.
LLM as a Decision Maker. In this case, complex tasks
are solved in a multi-round manner, often in an iterative
way [195]. Decision-makers often fulfill the following re-
sponsibilities: (1) Summarize the current context and the
history information, and decide if the information available
at the current step is sufficient to answer the question or
complete the task; (2) Organize and summarize the answer
to present it in a user-friendly way.
LLM as a Semantics Refiner. When LLM is used as a
Semantics Refiner, researchers mainly utilize its rich linguis-
tics and semantics knowledge. Specifically, LLMs are often
instructed to integrate information into consistent and fluent
natural language sentences [202] or generate texts according
to different specific needs [7
What are the limitations of large language models?
https://research.aimultiple.com/large-multimodal-models/
标签:these,ABILITIES,LIMITS,LLMs,multimodal,APPLICATIONS,complex,models,LLM From: https://www.cnblogs.com/lightsong/p/18535616What are the limitations of large language models?
- Data requirements and bias: These models require massive, diverse datasets for training. However, the availability and quality of such datasets can be a challenge. Moreover, if the training data contains biases, the model is likely to inherit and possibly amplify these biases, leading to unfair or unethical outcomes.
- Computational resources: Training and running large multimodal models require significant computational resources, making them expensive and less accessible for smaller organizations or independent researchers.
- Interpretability and explainability: As with a complex AI model, understanding how these models make decisions can be difficult. This lack of transparency can be a critical issue, especially in sensitive applications like healthcare or law enforcement.
- Integration of modalities: Effectively integrating different types of data (like text, images, and audio) in a way that truly understands the nuances of each modality is extremely challenging. The model might not always accurately grasp the context or the subtleties of human communication that come from combining these modalities.
- Generalization and overfitting: While these models are trained on vast datasets, they might struggle with generalizing to new, unseen data or scenarios that significantly differ from their training data. Conversely, they might overfit to the training data, capturing noise and anomalies as patterns.