使用ROCm在AMD GPU上进行Llama 3.2视觉LLMs推理

标签：prompt Item LLMs image AMD ROCm Llama 视觉模型

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm — ROCm Blogs

Meta的Llama模型现在支持多模态功能，扩展了其在传统文本应用之外的应用范围。Llama 3.2模型有多种尺寸，包括用于视觉-文本推理任务的中型11B和90B多模态模型，以及为边缘和移动设备设计的轻量级1B和3B纯文本模型。

本文将探讨如何利用Llama 3.2视觉模型在AMD GPU上使用ROCm进行各种视觉-文本任务…

Llama 3.2 视觉模型

Llama 3.2-Vision 系列多模态大语言模型 (LLMs) 包括 11B 和 90B 预训练的及指令微调的模型，用于图像推理。这些模型基于 Llama 3.1 纯文本基础模型构建，Llama 3.1 是一个自回归语言模型，使用优化的变压器架构。Llama 3.2-Vision 在 Llama 3.1 模型之上集成了一个视觉塔和一个图像适配器。视觉塔是一个基于注意力的变压器编码器，用于从图像中提取语义信息。适配器由交叉注意力层组成，将图像编码器的输出注入核心语言模型中。

11B 的 Llama 3.2 视觉模型利用了 Llama 3.1 的 8B 模型，而 90B 的 Llama 3.2 视觉模型使用了 Llama 3.1 的 70B 模型。适配器在图像-文本对上进行训练，以使图像特征与语言表示对齐。在训练过程中，图像编码器的参数会被更新，但语言模型的参数保持不变，以保留 Llama 3.1 在纯文本任务上的强大性能。

本博客将评估 Llama 3.2 视觉指令微调模型在视觉问答、结合视觉的数学推理、图像描述以及图表理解等任务中的表现。

欲了解更多信息，请参阅Llama 3.2: 使用开放、可定制模型的视觉和边缘人工智能的革命。

配置

有关配置的详细支持信息，请参阅ROCm 文档。本文使用以下配置创建。

硬件和操作系统:
- AMD Instinct GPU
- Ubuntu 22.04.3 LTS
软件:
- ROCm 6.1+
- PyTorch 2.1+ for ROCm

本博客中的演示在装有 MI300X GPU 的 Linux 机器上使用rocm/pytorch:rocm6.2.1_ubuntu20.04_py3.9_pytorch_release_2.3.0 docker 镜像完成。

本博客使用的完整源代码和图片可以在此Llama3_2_vision 博客 GitHub 仓库找到。

开始使用

安装所需的包:

!pip install transformers accelerate>=0.26.0

本文中使用的演示采用 meta-llama/Llama-3.2-90B-Vision-Instruct 视觉模型。访问 Llama 3.2 模型需要进行请求。请按照meta-llam/Llama-3.2-90B-Vision-Instruct 页面上的说明获取访问模型的权限。然后，可以如下提供您的 Hugging Face 账户令牌以访问模型：

from huggingface_hub import login
# 提供您的 Hugging Face 访问令牌
login("hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

输出:

    The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
    Token is valid (permission: read).
    Your token has been saved to /root/.cache/huggingface/token
    Login successful

meta-llama/Llama-3.2-90B-Vision-Instruct 模型经过优化，可以进行视觉识别、图像推理、图像描述和回答与图像相关的问题。AMD MI300X（192GB VRAM）可以在单个 GPU 上处理整个 90B 模型。如果您正在使用的 GPU 内存不足以处理 90B 模型，可以改用 11B 模型。

创建 Llama 3.2 视觉模型和图像预处理器：

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='cuda',
)
processor = AutoProcessor.from_pretrained(model_id)

确保模型在 GPU 上。

print(model.device)

输出:

    cuda:0

推理的辅助函数：

def inference(image, prompt, max_new_tokens):
    input_text = processor.apply_chat_template(
        prompt, add_generation_prompt=True,
    )
    inputs = processor(
        image,
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(model.device)
    
    output = model.generate(**inputs, max_new_tokens=max_new_tokens)
    print(processor.decode(output[0][inputs["input_ids"].shape[-1]:]))

通用视觉问答

视觉问答（VQA）是一项回答关于图像的开放性问题的任务。VQA 输出自然语言响应，回答自然语言问题。

# 图像来源: https://image-net.org/challenges/LSVRC/2017/
image = Image.open("./images/output_17_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please describe this image."}
    ]}
]
inference(image, prompt,200)

输出:

    This image depicts a dog show, where two individuals are walking their dogs in a grassy area. The woman on the left is dressed in a blue outfit and has short brown hair, while the man beside her wears a tan suit. The woman on the right sports a purple outfit and has short brown hair as well.
    
    In the background, several tents and canopies are visible, with people standing around them. A few cars are parked nearby, and trees can be seen in the distance. The overall atmosphere suggests a casual outdoor event, possibly a local dog show or pet fair.<|eot_id|>

聊天可以继续讨论图像。如果在用户的另一个回合中提交一张新图像，模型将从这一点开始引用新图像。无法同时查询两张不同的图像。

以下是继续对话的示例，我们包括助手的回应并请求更多细节：

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please describe this image."}
    ]},
    {"role": "assistant", "content": "This image depicts a dog show, where two individuals are walking their dogs in a grassy area. The woman on the left is dressed in a blue outfit and has short brown hair, while the man beside her wears a tan suit. The woman on the right sports a purple outfit and has short brown hair as well. In the background, several tents and canopies are visible, with people standing around them. A few cars are parked nearby, and trees can be seen in the distance. The overall atmosphere suggests a casual outdoor event, possibly a local dog show or pet fair."},
    {"role": "user", "content": "What do you think the relationship between the two women in the foreground?"}
]

inference(image, prompt, 150)

输出:

    The two women in the foreground appear to be competitors in the dog show, as they are both walking their dogs in a similar manner and are dressed in attire that suggests they are participating in a formal event. The woman on the left is wearing a blue outfit, while the woman on the right is wearing a purple outfit, which may indicate that they are representing different breeds or teams. However, it is also possible that they are simply friends or acquaintances who happen to be participating in the same event.<|eot_id|>

模型的回应准确地描述了图像。

在视觉环境中的数学推理

在视觉环境中的数学推理需要使用逻辑步骤和数学原理来解释、分析并解决基于视觉信息的问题。

# image source: https://huggingface.co/datasets/MMMU/MMMU
image = Image.open("./images/output_22_0.png")
image.show()
prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Subtract all green cubes and all red matte blocks from the set. How many objects remain? Please provide a step-by-step explanation of your reasoning."}
    ]}
]
inference(image, prompt, 250)

png

输出:

    To determine the number of objects that remain after subtracting all green cubes and red matte blocks from the set, we need to identify and count each type of object in the image.
    
    **Step 1: Identify Green Cubes**
    There is one green cube in the image.
    
    **Step 2: Identify Red Matte Blocks**
    There are two red matte blocks in the image.
    
    **Step 3: Count Remaining Objects**
    After removing the green cube and red matte blocks, we are left with the following objects:
    
    *   Blue sphere
    *   Purple sphere
    *   Yellow cylinder (left)
    *   Yellow cylinder (right)
    
    **Step 4: Calculate Total Remaining Objects**
    By adding up the remaining objects, we get a total of 4 objects.
    
    Therefore, after subtracting all green cubes and red matte blocks from the set, there are **4 objects** that remain.<|eot_id|>

模型回答了问题。它识别出物体的形状、质地和颜色，并提供了详细的逐步解决方案解释，有助于理解推理过程。

接下来，我们测试Llama 3.2解释柱状图的能力。

# image source: https://huggingface.co/datasets/MMMU/MMMU/viewer/Math/test?row=92&image-viewer=image_1-92-CCB9FDFF7AE3EB01AC13B80B2EC0EAA7898D4C84
image = Image.open("./images/output_24_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Approximately how many more students ride in a car than walk? Please walk through your reasoning step by step to arrive at the solution."}
    ]}
]
inference(image, prompt,350)

png

输出:

    To find the difference between the number of students who ride in a car and those who walk, we need to subtract the number of students who walk from the number of students who ride in a car.
    
    **Step 1: Identify the number of students who walk.**
    The number of students who walk is approximately 50.
    
    **Step 2: Identify the number of students who ride in a car.**
    The number of students who ride in a car is approximately 150.
    
    **Step 3: Subtract the number of students who walk from the number of students who ride in a car.**
    150 (students who ride in a car) - 50 (students who walk) = 100
    
    Therefore, approximately **100 more students ride in a car than walk**.<|eot_id|>

文档级别视觉问答

文档级别的视觉问答（VQA）是一项任务，即回答与以图像格式表示的文档内容相关的问题。

# 图片来源: AMD
image = Image.open("./images/output_26_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Help me to understand the trend of Company A and B for the desktop market share from 2004 to 2023"}
    ]}
]
inference(image, prompt,300)

输出:

    Based on the line graph, here is a detailed description of the trend for Company A and Company B from 2004 to 2023:
    
    **Company A (Orange Line)**
    
    * The trend for Company A starts at around 0.4 in 2004 and remains relatively stable until 2007.
    * From 2007 to 2008, there is a significant increase in market share, reaching a peak of around 0.55.
    * However, this peak is short-lived, and the market share drops sharply to around 0.2 in 2009.
    * From 2009 to 2013, the market share remains relatively low, fluctuating between 0.1 and 0.2.
    * Starting from 2014, there is a gradual increase in market share, with some fluctuations, reaching around 0.35 in 2018.
    * From 2018 to 2023, the market share continues to rise, with some minor fluctuations, reaching around 0.65 in 2023.
    
    **Company B (Grey Line)**
    
    * The trend for Company B starts at around 0.6 in 2004 and remains relatively stable until 2007.
    * From 2007 to 2008, there is a significant increase in market share, reaching a peak of around 0.8.
    * The market share remains high until 2011, with some minor fluctuations.
    * From 2011 to 2015

模型准确分析了2004年至2023年两家公司的市场份额趋势。

image = Image.open("./images/output_27_0.png")
image.show()

prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Help me understand the revenue shown in this chart and sort the items in descending order based on their revenue."}
    ]}
]
inference(image, prompt,300)

输出:

    Based on the bar chart, here is a breakdown of the revenue for each item and the sorted list in descending order:
    
    **Item Revenue Breakdown:**
    
    * Item 1: $18
    * Item 2: $23
    * Item 3: $17
    * Item 4: $18
    * Item 5: $16
    * Item 6: $14
    * Item 7: $33
    * Item 8: $24
    
    **Sorted List in Descending Order:**
    
    1. Item 7: $33
    2. Item 8: $24
    3. Item 2: $23
    4. Item 1: $18
    5. Item 4: $18
    6. Item 3: $17
    7. Item 5: $16
    8. Item 6: $14
    
    This analysis shows that Item 7 has the highest revenue at $33, while Item 6 has the lowest revenue at $14.<|eot_id|>

模型准确解释了视觉信息，并提供了数据分析。

# 图片来源: https://huggingface.co/datasets/MMMU/MMMU
url = "https://camo.githubusercontent.com/fdca7c38467afeb70cede550b392b0f534150647942f9827bf72497a4f264e9c/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f72656c656173652d6173736574732f7265736f6c76652f6d61696e2f696e766f6963652e706e67"
image = Image.open(requests.get(url, stream=True).raw)
image.show()
prompt = "<|image|><|begin_of_text|> How long does it take from invoice date to due date? Be short and concise."
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0]))

输出:

    <|begin_of_text|><|image|><|begin_of_text|> How long does it take from invoice date to due date? Be short and concise. The invoice date is 11/02/2019 and the due date is 26/02/2019. Therefore, the time between the invoice date and the due date is 15 days. (26-11=15) *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*: 15 days. *Answer*:

结果表明该模型能够准确解读视觉信息，并为问题提供简明答案。

图像描述

图像描述是一种深度学习过程，通过使用文本来描述图像内容

# 图片来源: coco 数据集- http://cocodataset.org/#explore?id=49097
image = Image.open("./images/output_30_0.png")
image.show()
prompt = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please provide a caption for the image."}
    ]}
]
inference(image, prompt,300)

输出:

    The image depicts a dog sitting on the sidewalk next to a bicycle parked in front of a red building. The dog is white with black spots and wears a black collar, facing right towards the camera. It sits on a gray brick sidewalk, positioned to the right of the image.
    
    To the left of the dog, a bicycle leans against the wall of the building. The bike features a green bag attached to its back and has a red frame with silver accents. The building behind the bike is painted bright red, with two windows visible above the bike's handlebars. Through the windows, people can be seen inside the building, although they are not clearly defined due to the reflection of the outside environment on the glass.
    
    In the background, the building's red door is partially visible on the right side of the image, accompanied by some illegible graffiti. Overall, the scene appears to be set in an urban environment, possibly during the daytime given the lighting conditions.<|eot_id|>

响应是准确的，提供了图像的详细描述。

结论

Meta的Llama 3.2视觉模型推进了视觉和语言理解的整合，使AI能够解释和推理视觉数据。当与使用ROCm的AMD GPU的处理能力相结合时，这些模型在各种视觉-文本任务中表现出色，比如基于图像的问答和视觉数学推理。这种结合使开发人员能够创建更快速和更可扩展的AI工作流程，突显了多模态AI应用在下一代解决方案中的潜力。

标签：prompt,Item,LLMs,image,AMD,ROCm,Llama,视觉,模型
From： https://blog.csdn.net/eidolon_foot/article/details/144098020