A Survey on Multimodal Large Language Models
https://arxiv.org/pdf/2306.13549
多模态大预言模型,其是基于LLM,同时具有了接收、推理、输出多模态信息的能力。
In light of this complementarity, LLM and LVM run
towards each other, leading to the new field of Multimodal
Large Language Model (MLLM). Formally, it refers to the
LLM-based model with the ability to receive, reason, and
output with multimodal information
三大模型基于图像的推理能力
https://hiresynth.ai/blog/googleio_puzzle_multimodal_eval.html#introduction-the-models
OpenAI GPT-4V
The multimodal LLM craze started with the release of GPT-4V in September and the enticing caption:
"ChatGPT can now see, hear, and speak"Google Gemini Ultra
Next Google Gemini Ultra was released in December, along with the following press release:
"[Gemini] was built from the ground up to be multimodal, which means it can generalize and seamlessly understand, operate across and combine different types of information including text, code, audio, image and video."Anthropic Claude3 Opus
Finally, Anthropic Claude3 Opus has just been released in February, with the following caption: "The Claude 3 models have sophisticated vision capabilities on par with other leading models. They can process a wide range of visual formats, including photos, charts, graphs and technical diagrams."
Along with the release of Claude3, we were provided a handy chart comparing the multimodal capabilities of the three models:
标签:模态,models,模型,multimodal,Gemini,LLM,release From: https://www.cnblogs.com/lightsong/p/18391718