首页 > 其他分享 >使用ROCm在AMD GPU上进行Llama 3.2视觉LLMs推理

使用ROCm在AMD GPU上进行Llama 3.2视觉LLMs推理

时间:2024-11-29 08:59:54浏览次数:9  
标签:prompt Item LLMs image AMD ROCm Llama 视觉 模型

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm — ROCm Blogs

Meta的Llama模型现在支持多模态功能,扩展了其在传统文本应用之外的应用范围。Llama 3.2模型有多种尺寸,包括用于视觉-文本推理任务的中型11B和90B多模态模型,以及为边缘和移动设备设计的轻量级1B和3B纯文本模型。

本文将探讨如何利用Llama 3.2视觉模型在AMD GPU上使用ROCm进行各种视觉-文本任务… 

Llama 3.2 视觉模型

Llama 3.2-Vision 系列多模态大语言模型 (LLMs) 包括 11B 和 90B 预训练的及指令微调的模型,用于图像推理。这些模型基于 Llama 3.1 纯文本基础模型构建,Llama 3.1 是一个自回归语言模型,使用优化的变压器架构。Llama 3.2-Vision 在 Llama 3.1 模型之上集成了一个视觉塔和一个图像适配器。视觉塔是一个基于注意力的变压器编码器,用于从图像中提取语义信息。适配器由交叉注意力层组成,将图像编码器的输出注入核心语言模型中。

11B 的 Llama 3.2 视觉模型利用了 Llama 3.1 的 8B 模型,而 90B 的 Llama 3.2 视觉模型使用了 Llama 3.1 的 70B 模型。适配器在图像-文本对上进行训练,以使图像特征与语言表示对齐。在训练过程中,图像编码器的参数会被更新,但语言模型的参数保持不变,以保留 Llama 3.1 在纯文本任务上的强大性能。

本博客将评估 Llama 3.2 视觉指令微调模型在视觉问答、结合视觉的数学推理、图像描述以及图表理解等任务中的表现。

欲了解更多信息,请参阅Llama 3.2: 使用开放、可定制模型的视觉和边缘人工智能的革命

配置

有关配置的详细支持信息,请参阅ROCm 文档。本文使用以下配置创建。

本博客中的演示在装有 MI300X GPU 的 Linux 机器上使用rocm/pytorch:rocm6.2.1_ubuntu20.04_py3.9_pytorch_release_2.3.0 docker 镜像完成。

本博客使用的完整源代码和图片可以在此Llama3_2_vision 博客 GitHub 仓库找到。

开始使用

安装所需的包:

!pip install transformers accelerate>=0.26.0

本文中使用的演示采用 meta-llama/Llama-3.2-90B-Vision-Instruct 视觉模型。访问 Llama 3.2 模型需要进行请求。请按照meta-llam/Llama-3.2-90B-Vision-Instruct 页面上的说明获取访问模型的权限。然后,可以如下提供您的 Hugging Face 账户令牌以访问模型:

from huggingface_hub import login
# 提供您的 Hugging Face 访问令牌
login("hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

输出:

    The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
    Token is valid (permission: read).
    Your token has been saved to /root/.cache/huggingface/token
    Login successful

meta-llama/Llama-3.2-90B-Vision-Instruct 模型经过优化,可以进行视觉识别、图像推理、图像描述和回答与图像相关的问题。AMD MI300X(192GB VRAM)可以在单个 GPU 上处理整个 90B 模型。如果您正在使用的 GPU 内存不足以处理 90B 模型,可以改用 11B 模型

创建 Llama 3.2 视觉模型和图像预处理器:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='cuda',
)
processor = AutoProcessor.from_pretrained(model_id)

确保模型在 GPU 上。

print(model.device)

输出:

    cuda:0

推理的辅助函数:

def inference(image, prompt, max_new_tokens):
    input_text = processor.apply_chat_template(
        prompt, add_generation_prompt=True,
    )
    inputs = processor(
        image,
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(model.device)
    
    output = model.generate(**inputs, max_new_tokens=max_new_tokens)
    print(processor.decode(output[0][inputs["input_ids"].shape[-1]:]))

通用视觉问答 

视觉问答(VQA)是一项回答关于图像的开放性问题的任务。VQA 输出自然语言响应,回答自然语言问题。

# 图像来源: https://image-net.org/challenges/LSVRC/2017/
image = Image.open("./images/output_17_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please describe this image."}
    ]}
]
inference(image, prompt,200)

输出:

    This image depicts a dog show, where two individuals are walking their dogs in a grassy area. The woman on the left is dressed in a blue outfit and has short brown hair, while the man beside her wears a tan suit. The woman on the right sports a purple outfit and has short brown hair as well.
    
    In the background, several tents and canopies are visible, with people standing around them. A few cars are parked nearby, and trees can be seen in the distance. The overall atmosphere suggests a casual outdoor event, possibly a local dog show or pet fair.<|eot_id|>

聊天可以继续讨论图像。如果在用户的另一个回合中提交一张新图像,模型将从这一点开始引用新图像。无法同时查询两张不同的图像。

以下是继续对话的示例,我们包括助手的回应并请求更多细节:

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please describe this image."}
    ]},
    {"role": "assistant", "content": "This image depicts a dog show, where two individuals are walking their dogs in a grassy area. The woman on the left is dressed in a blue outfit and has short brown hair, while the man beside her wears a tan suit. The woman on the right sports a purple outfit and has short brown hair as well. In the background, several tents and canopies are visible, with people standing around them. A few cars are parked nearby, and trees can be seen in the distance. The overall atmosphere suggests a casual outdoor event, possibly a local dog show or pet fair."},
    {"role": "user", "content": "What do you think the relationship between the two women in the foreground?"}
]

inference(image, prompt, 150)

输出:

    The two women in the foreground appear to be competitors in the dog show, as they are both walking their dogs in a similar manner and are dressed in attire that suggests they are participating in a formal event. The woman on the left is wearing a blue outfit, while the woman on the right is wearing a purple outfit, which may indicate that they are representing different breeds or teams. However, it is also possible that they are simply friends or acquaintances who happen to be participating in the same event.<|eot_id|>

模型的回应准确地描述了图像。 

在视觉环境中的数学推理

在视觉环境中的数学推理需要使用逻辑步骤和数学原理来解释、分析并解决基于视觉信息的问题。

# image source: https://huggingface.co/datasets/MMMU/MMMU
image = Image.open("./images/output_22_0.png")
image.show()
prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Subtract all green cubes and all red matte blocks from the set. How many objects remain? Please provide a step-by-step explanation of your reasoning."}
    ]}
]
inference(image, prompt, 250)

png

输出:

    To determine the number of objects that remain after subtracting all green cubes and red matte blocks from the set, we need to identify and count each type of object in the image.
    
    **Step 1: Identify Green Cubes**
    There is one green cube in the image.
    
    **Step 2: Identify Red Matte Blocks**
    There are two red matte blocks in the image.
    
    **Step 3: Count Remaining Objects**
    After removing the green cube and red matte blocks, we are left with the following objects:
    
    *   Blue sphere
    *   Purple sphere
    *   Yellow cylinder (left)
    *   Yellow cylinder (right)
    
    **Step 4: Calculate Total Remaining Objects**
    By adding up the remaining objects, we get a total of 4 objects.
    
    Therefore, after subtracting all green cubes and red matte blocks from the set, there are **4 objects** that remain.<|eot_id|>

模型回答了问题。它识别出物体的形状、质地和颜色,并提供了详细的逐步解决方案解释,有助于理解推理过程。

接下来,我们测试Llama 3.2解释柱状图的能力。

# image source: https://huggingface.co/datasets/MMMU/MMMU/viewer/Math/test?row=92&image-viewer=image_1-92-CCB9FDFF7AE3EB01AC13B80B2EC0EAA7898D4C84
image = Image.open("./images/output_24_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Approximately how many more students ride in a car than walk? Please walk through your reasoning step by step to arrive at the solution."}
    ]}
]
inference(image, prompt,350)

png

输出:

    To find the difference between the number of students who ride in a car and those who walk, we need to subtract the number of students who walk from the number of students who ride in a car.
    
    **Step 1: Identify the number of students who walk.**
    The number of students who walk is approximately 50.
    
    **Step 2: Identify the number of students who ride in a car.**
    The number of students who ride in a car is approximately 150.
    
    **Step 3: Subtract the number of students who walk from the number of students who ride in a car.**
    150 (students who ride in a car) - 50 (students who walk) = 100
    
    Therefore, approximately **100 more students ride in a car than walk**.<|eot_id|>

文档级别视觉问答

文档级别的视觉问答(VQA)是一项任务,即回答与以图像格式表示的文档内容相关的问题。

# 图片来源: AMD
image = Image.open("./images/output_26_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Help me to understand the trend of Company A and B for the desktop market share from 2004 to 2023"}
    ]}
]
inference(image, prompt,300)

输出:

    Based on the line graph, here is a detailed description of the trend for Company A and Company B from 2004 to 2023:
    
    **Company A (Orange Line)**
    
    * The trend for Company A starts at around 0.4 in 2004 and remains relatively stable until 2007.
    * From 2007 to 2008, there is a significant increase in market share, reaching a peak of around 0.55.
    * However, this peak is short-lived, and the market share drops sharply to around 0.2 in 2009.
    * From 2009 to 2013, the market share remains relatively low, fluctuating between 0.1 and 0.2.
    * Starting from 2014, there is a gradual increase in market share, with some fluctuations, reaching around 0.35 in 2018.
    * From 2018 to 2023, the market share continues to rise, with some minor fluctuations, reaching around 0.65 in 2023.
    
    **Company B (Grey Line)**
    
    * The trend for Company B starts at around 0.6 in 2004 and remains relatively stable until 2007.
    * From 2007 to 2008, there is a significant increase in market share, reaching a peak of around 0.8.
    * The market share remains high until 2011, with some minor fluctuations.
    * From 2011 to 2015

模型准确分析了2004年至2023年两家公司的市场份额趋势。

image = Image.open("./images/output_27_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Help me understand the revenue shown in this chart and sort the items in descending order based on their revenue."}
    ]}
]
inference(image, prompt,300)

输出: 

    Based on the bar chart, here is a breakdown of the revenue for each item and the sorted list in descending order:
    
    **Item Revenue Breakdown:**
    
    * Item 1: $18
    * Item 2: $23
    * Item 3: $17
    * Item 4: $18
    * Item 5: $16
    * Item 6: $14
    * Item 7: $33
    * Item 8: $24
    
    **Sorted List in Descending Order:**
    
    1. Item 7: $33
    2. Item 8: $24
    3. Item 2: $23
    4. Item 1: $18
    5. Item 4: $18
    6. Item 3: $17
    7. Item 5: $16
    8. Item 6: $14
    
    This analysis shows that Item 7 has the highest revenue at $33, while Item 6 has the lowest revenue at $14.<|eot_id|>

模型准确解释了视觉信息,并提供了数据分析。

# 图片来源: https://huggingface.co/datasets/MMMU/MMMU
url = "https://camo.githubusercontent.com/fdca7c38467afeb70cede550b392b0f534150647942f9827bf72497a4f264e9c/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f72656c656173652d6173736574732f7265736f6c76652f6d61696e2f696e766f6963652e706e67"
image = Image.open(requests.get(url, stream=True).raw)
image.show()
prompt = "<|image|><|begin_of_text|> How long does it take from invoice date to due date? Be short and concise."
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0]))

输出:

    <|begin_of_text|><|image|><|begin_of_text|> How long does it take from invoice date to due date? Be short and concise. The invoice date is 11/02/2019 and the due date is 26/02/2019. Therefore, the time between the invoice date and the due date is 15 days. (26-11=15) *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*:

结果表明该模型能够准确解读视觉信息,并为问题提供简明答案。

图像描述

图像描述是一种深度学习过程,通过使用文本来描述图像内容

# 图片来源: coco 数据集- http://cocodataset.org/#explore?id=49097
image = Image.open("./images/output_30_0.png")
image.show()
prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please provide a caption for the image."}
    ]}
]
inference(image, prompt,300)

输出:

    The image depicts a dog sitting on the sidewalk next to a bicycle parked in front of a red building. The dog is white with black spots and wears a black collar, facing right towards the camera. It sits on a gray brick sidewalk, positioned to the right of the image.
    
    To the left of the dog, a bicycle leans against the wall of the building. The bike features a green bag attached to its back and has a red frame with silver accents. The building behind the bike is painted bright red, with two windows visible above the bike's handlebars. Through the windows, people can be seen inside the building, although they are not clearly defined due to the reflection of the outside environment on the glass.
    
    In the background, the building's red door is partially visible on the right side of the image, accompanied by some illegible graffiti. Overall, the scene appears to be set in an urban environment, possibly during the daytime given the lighting conditions.<|eot_id|>

响应是准确的,提供了图像的详细描述。

结论

Meta的Llama 3.2视觉模型推进了视觉和语言理解的整合,使AI能够解释和推理视觉数据。当与使用ROCm的AMD GPU的处理能力相结合时,这些模型在各种视觉-文本任务中表现出色,比如基于图像的问答和视觉数学推理。这种结合使开发人员能够创建更快速和更可扩展的AI工作流程,突显了多模态AI应用在下一代解决方案中的潜力。

标签:prompt,Item,LLMs,image,AMD,ROCm,Llama,视觉,模型
From: https://blog.csdn.net/eidolon_foot/article/details/144098020

相关文章

  • LLMs Learn Task Heuristics from Demonstrations: A Heuristic-Driven Prompting St
    1.概述关于基于COT的Prompt构造有很多的研究,例如:CoT(Weietal.,2022),Automate-CoT(Shumetal.,2023),Auto-CoT(Zhangetal.,2023),Iter-CoT(Sunetal.,2023),Active-CoT(Diaoetal.,2023)。本篇文章尝试给出了一种解释:LLM基于有监督的ICL(in-contextlearni......
  • 高性能AMD香港服务器:卓越性能与优势完美结合
    高性能AMD香港鼎峰服务器:卓越性能与全面优势的完美结合在当今数字化时代,高性能服务器已成为企业和开发者不可或缺的重要工具。香港鼎峰服务器凭借其卓越的性能和全面的优势,在众多服务器提供商中脱颖而出,成为众多企业和开发者的首选。香港鼎峰服务器采用先进的硬件设施和高......
  • 机器学习(ML)和大型语言模型(LLMs)学习路线图
    学生应该在微积分、统计学、计量经济学、基本经济理论和任何高级语言(最好是Python)的编程经验方面有扎实的基础。微积分有助于理解优化问题,这是许多机器学习算法的核心,特别是在梯度下降和神经网络。统计对于理解概率分布、假设检验和推理至关重要,这些是大多数机器学习模型的......
  • 开源5款可用于LLMs的爬虫工具/方案
    大家好,我是一颗甜苞谷,今天来分享5款可用于LLMs的爬虫工具/方案1、Crawl4AI功能:提取语义标记的数据块为JSON格式,提供干净的HTML和Markdown文件。用途:适用于RAG(检索增强生成)、微调以及AI聊天机器人的开发。特点:高效数据提取,支持LLM格式,多URL支持,易于集成和Docker容器化......
  • Linux !ko/5.17-BBRplus AMD64(X86_64)内核致命的 futex_wait 函数死锁问题。
    !ko表示系统内核(system-kernel)致命:在CentOS(RedHat)、Ubuntu、Debian等多个发行版本Linux操作系统上,若人们升级 5.17-BBRplus版本内核,那么在应用程式频繁的futex_wait(syscall)等待唤醒时,或会存在futex_wait函数发生死锁的疑难问题。LMP:futex(2)-Linuxmanualpa......
  • 聊聊LLMs与CIM
    聊聊LLMs与CIM1.LLMs的近况首先对LLMs,即大语言模型做个简单的回顾,之前也写过长文介绍过来龙去脉。我们知道目前LLMs的基础是2017年提出的Transformer模型结构,更准确的说,现在LLMs中的主流方案是使用Decoderonly的Transformer架构。LLMs的工作方式采用的简单的"predictnextwor......
  • 在 X86_64(amd64) 平台上的docker支持打包跨平台的镜像(如arm64)
    在信创,ARM开始崛起的现在,Docker也从一开始的只支持x86_64架构变为支持各种架构了,虽然Docker的目的是保证只要Docker安装好,在任意机器上运行都能达到一样的效果,但是这个的前提是Docker镜像的架构和当前服务器的架构一致,以前都是x84_64架构自然可以,但现在也有别的架构,因此......
  • AI推介-大语言模型LLMs论文速览(arXiv方向):2024.08.25-2024.08.31
    文章目录~1.LongRecipe:RecipeforEfficientLongContextGeneralizationinLargeLanguageModels2.GenAI-poweredMulti-AgentParadigmforSmartUrbanMobility:OpportunitiesandChallengesforIntegratingLargeLanguageModels(LLMs)andRetrieval-Augm......
  • OpenAI o1模型揭秘:通过LLMs学习推理能力
    OpenAI推出了o1,这是一种通过强化学习训练的大型语言模型,专门用于进行复杂的推理任务。o1在回答问题之前会“思考”,能够在响应用户之前生成一条长的内部思维链。在编程竞赛问题(Codeforces)中,OpenAIo1的排名在89%分位,位列美国数学奥林匹克预选赛(AIME)前500名学生之列,并且在物理、生......
  • 大模型 LLMs 入门指南:小白的学习之路
    前言很明显,这是一个偏学术方向的指南要求,所以我会把整个LLM应用的从数学到编程语言,从框架到常用模型的学习方法,给你捋一个通透。也可能是不爱学习的劝退文。通常要达到熟练的进行LLM相关的学术研究与开发,至少你要准备数学、编码、常用模型的知识,还有LLM相关的知识的准备......