标签：__ self processor LAVIS Qwen video MiniGPT4 model image

LAVIS库

LAVIS库

一、lavis库介绍

LAViS是一个用于语言和视觉研究及应用的Python深度学习库。它具有统一的设计，可以访问最先进的基础语言-视觉模型（ALBEF、BLIP、ALPRO、CLiP）、常见任务（检索、字幕、视觉问答、多模态分类等）和数据集（COCO、Flickr、Nocaps、概念性公共领域、SBU等）。

LAVis六个关键模块：

lavis.runners：管理整体的训练和评估生命周期。它还负责按需延迟创建所需的组件，例如优化器、学习率调度器和数据加载器。目前，RunnerBase实现了基于周期的训练，RunnerIters实现了基于迭代的训练。
lavis.tasks：实现每个任务的具体训练和评估逻辑。任务可以是检索、字幕生成、预训练等。拥有任务抽象的原因是容纳特定任务的训练和评估。例如，评估检索模型与评估分类模型是不同的。
lavis.datasets：负责创建数据集，其中lavis.datasets.builders加载数据集配置，下载注释并返回数据集对象；lavis.datasets.datasets定义了支持的数据集，每个都是一个torch.utils.data.Dataset实例。我们还提供了自动数据集下载工具在datasets/download_scripts中，以帮助准备常见的公共数据集。
lavis.models：持有支持的模型和共享模型层的定义。
lavis.processors：处理在输入模型之前对文本和图像/视频的预处理。对于图像和视频，处理器可以被视为torchvision中的转换；对于文本输入，这可能包括小写化、截断等。
lavis.common 模块包含多个其他模块使用的共享类和方法。例如：
- lavis.common.config：包含存储和操作LAVis使用的配置文件的类。特别是，我们使用分层配置设计，以允许高度可定制的训练和评估。
- lavis.common.registry：作为一个集中管理具有相同功能的模块的地方。它允许在运行时通过在配置文件中指定它们的名称为字符串来构建数据集、模型、任务和学习率调度器。
- lavis.common.optims：包含学习率调度器的定义。
- lavis.common.dist_utils：包含分布式训练和评估的实用工具。
- lavis.common.utils：包含杂项实用工具，主要是与输入/输出相关的辅助函数。

二、体验示例

如何使用LAVIS中的模型对示例数据执行推理。我们首先从本地加载示例图像。

from PIL import Image

raw_image = Image.open("../data/11.png").convert("RGB")
raw_image

Image Captioning

使用BLIP模型为图像生成标题。

为了使推理更加容易，将每个预训练模型与其预处理器（transforms）相关联，通过load_model_and_preprocess()访问。

from lavis.models import load_model_and_preprocess

# 加载BLIP标题base模型，再MSCOCO标题数据集上微调得到
# 同时也会得到图片预处理器
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

# 预处理图片
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# generate caption
# 注意参数的设置
model.generate({"image": image}, num_beams=1, top_p=0.9, max_length=20, min_length=5)

# output
['a cat with a tie and a nose']

Visual question answering (VQA)

视觉QA

BLIP模型能够以自然语言回答有关图像的自由格式问题。

要访问VQA模型，只需替换传递给load_model_and_preprocess()的name和model_type。

from lavis.models import load_model_and_preprocess

# 返回了视觉预处理器，文本预处理器
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)

# ask a random question.
# question = "Which city is this photo taken?"
question = "what is this photo taken?"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)
model.predict_answers(samples = {"image":image, "text_input":question}, inference_method="generate", num_beams=1)

# output
['in front of camera']

要传递参数 num_beams = 1 否则会报错，维度不匹配

Unified Feature Extraction Interface

LAVIS提供了一个统一的接口来从每个架构中提取特征。

为了提取特征，我们加载每个模型的特征提取器变体。

multimodal feature 多模态特征可用于多模态分类。低维 unimodal features 单峰特征可用于计算跨模态相似度。

from lavis.models import load_model_and_preprocess

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)

caption = "a large fountain spewing water into the air"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text_input = txt_processors["eval"](caption)
sample = {"image": image, "text_input": [text_input]}

# 多模态特征 torch.Size([1, 12, 768])
# use features_multimodal[:,0,:] for multimodal classification tasks
features_multimodal = model.extract_features(sample)
print(features_multimodal.multimodal_embeds.shape)

# 提取视觉特征？ torch.Size([1, 197, 768])
features_image = model.extract_features(sample, mode="image")
print(features_image.image_embeds.shape)

# 提取文本特征？ torch.Size([1, 12, 768])
features_text = model.extract_features(sample, mode="text")
print(features_text.text_embeds.shape)

# low-dimensional projected features
print(features_image.image_embeds_proj.shape)
# torch.Size([1, 197, 256])

print(features_text.text_embeds_proj.shape)
# torch.Size([1, 12, 256])

similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()
print(similarity)
# tensor([[0.1090]], device='cuda:0')

加载数据集

# 查看数据集
from lavis.datasets.builders import dataset_zoo

dataset_names = dataset_zoo.get_names()
print(dataset_names)
'''
['aok_vqa', 'avsd_dialogue', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m', 'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'gqa', 'imagenet', 'laion2B_multi', 'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr', 'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
'''

from lavis.datasets.builders import load_dataset
# 加载数据集
coco_dataset = load_dataset("coco_caption")
print(coco_dataset.keys())  # dict_keys(['train', 'val', 'test'])
print(len(coco_dataset["train"]))   # 566747
print(coco_dataset["train"].annotation[0])
''' 
{
    'caption': 'A woman wearing a net on her head cutting a cake. ', 
    'image': 'val2014/COCO_val2014_000000522418.jpg', 
    'image_id': 'coco_522418', 
    'instance_id': '0'
}
'''

在任务数据集上评估预训练模型

LAVIS提供了在预训练和微调模型上使用任务数据集的现成评估。

现在让我们看一个使用MSCOCO数据集在标题任务上评估BLIP模型的示例。

准备数据

LAVIS提供自动下载脚本来帮助准备大部分公共数据集，下载MSCOCO数据集，只需运行

cd lavis/datasets/download_scripts && python download_coco.py

下载的数据集放在LAVIS使用的默认缓存位置cache中

自定义缓存位置，通过更新cache_root来指定lavis/configs/default.yaml

如果已经拥有数据集的一个本地副本，建议在缓存位置创建一个指向这个本地副本的符号链接。

ln -s /path/local/coco cache/coco

评估预训练模型

评估预训练模型：bash run_scripts/blip/eval/eval_coco_cap.sh

评估large model：bash run_scripts/blip/eval/eval_coco_cap_large.sh

在哪里指定路径？

python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/blip/eval/caption_coco_eval.yaml

model:
  arch: blip_caption
  model_type: base_coco

datasets:
  coco_caption: # name of the dataset builder
    vis_processor:
        eval:
          name: "blip_image_eval"
    text_processor:
        eval:
          name: "blip_caption"

run:
  # task: retrieval
  task: captioning
  # optimizer
  batch_size_train: 32
  batch_size_eval: 64
  num_workers: 4

  max_len: 20
  min_len: 5
  num_beams: 3

  seed: 42
  output_dir: "output/BLIP/Caption_coco"

  evaluate: True
  test_splits: ["test"]

  device: "cuda"
  world_size: 1
  dist_url: "env://"
  distributed: True

没有位置指定评估的模型啊，在 evaluate.py中吗？也没有啊？

微调 BLIP在COCO-Captioning数据集

bash run_scripts/blip/train/train_caption_coco_large.sh

这将把预训练的BLIP大模型微调为可用于图片标题的新模型

深度剖析

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/caption_coco_large_ft.yaml

模型配置

model:
    arch: blip_caption
    model_type: large_coco
    load_finetuned: False

arch 指定使用的模型架构

通过model_zoo可以查看可用的模型架构

the runner will look for the model class registered with the name

In this case BlipCaption is the model registered with the name blip_caption

registry 包含了从 name string 到模型类的映射，能够让runner动态地基于配置文件中的name string 找到模型类

# lavis/models/blip_models/blip_caption.py
# shows how BlipCaption is registered with the name string blip_caption

@registry.register_model("blip_caption")
class BlipCaption(BlipBase):
    """
    BLIP captioning model.
    Supported model types:
        - base_coco: fine-tuned BLIP base model on COCO caption dataset (Karparthy split).
        - large_coco: fine-tuned BLIP large model on COCO caption dataset (Karparthy split).
    Usage:
        >>> from lavis.models import load_model
        >>> model = load_model("blip_caption", "base_coco")
        >>> model = load_model("blip_caption", "large_coco")
    """
    PRETRAINED_MODEL_CONFIG_DICT = {
        "base_coco": "configs/models/blip_caption_base_coco.yaml",
        "large_coco": "configs/models/blip_caption_large_coco.yaml",
    }

model_type 指定微调的模型类型

比如，BlipCaption有预训练的base模型和large模型

设置load_finetuned = False，从预训练的权重开始微调，否则设置为True会加载已经在 coco captioning 微调过的权重

图片中的内容翻译为中文如下：

给定模型架构和类型，库会查找 laviz/models/blip_models/blip_caption.py 中的 large_coco 的默认模型配置。如上述代码片段所示，相应的配置路径存储在 Bliipcaption.PRETRAINED_MODEL_CONFIG_DICT 中。然后，库将加载 laviz/configs/models/blip_caption_large_coco.yaml 作为构建模型的配置。

配置优先级：请注意，run config 的优先级高于默认模型配置。这意味着运行配置中的参数将覆盖默认模型配置。例如，在默认模型配置中，默认将 load_finetuned 设置为 True，而在 run config 中，我们将其设置为 False，仅从预训练权重进行微调。

数据集配置

datasets:
  coco_caption: # name of the dataset builder
    vis_processor:
        train:
          name: "blip_image_train"
        eval:
          name: "blip_image_eval"
    text_processor:
        train:
          name: "blip_caption"
          prompt: "a picture of "
        eval:
          name: "blip_caption"

每个数据集对应 vis_processor 和 text_processor，分别负责处理视觉和文本输入

use the registry mechanism 动态加载预处理器类

blip_image_train 是 BlipImageTrainProcessor 类的name string，registered in lavis/processors/blip_processors.py

dataset name string is also registered in the registry，指向 dataset builder COCOCapBuilder类

默认 builder 加载默认的数据集配置文件 in DATASET_CONFIG_DICT

datasets:
  coco_caption: # name of the dataset builder
    dataset_card: dataset_card/coco_caption.md
    # data_dir: ${env.data_dir}/datasets
    data_type: images # [images|videos|features]

    build_info:
      # Be careful not to append minus sign (-) before split to avoid itemizing
      annotations:
        train:
          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json
          md5: aa31ac474cf6250ebb81d18348a07ed8
          storage: coco/annotations/coco_karpathy_train.json
        val:
          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json
          md5: b273847456ef5580e33713b1f7de52a0
          storage:  coco/annotations/coco_karpathy_val.json
        test:
          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json
          md5: 3ff34b0ef2db02d01c37399f6a2a6cd1
          storage: coco/annotations/coco_karpathy_test.json
      images:
        # 指定图片根路径，相对于cache的路径
        storage: coco/images/

build信息划分配注释 and 图片

LAVIS支持使用多个数据集进行训练。请参阅 lavis/projects/blip/train/pretrain_14m.yaml

三、lavis自定义模块

3.1 自定义数据集 Datasets

使用 lavis.datasets模块创建新的数据集

LAVIS库包含一个标准的数据集模块，允许自定义添加新的数据集。

包含创建数据集配置，定义和联系新的数据集类

我们将复制为基于视频的对话任务的视听场景感知对话（AVSD）基准添加数据集类的步骤。

数据集配置

首先为数据集定义一个基础的配置文件，包含一个新的数据集类 avsd_dialogue, dataset card 和数据类型

在lavis.configs.datasets中定义新的数据集配置

如下 avsd/default_dial.yaml

datasets:
	avsd_dialogue:
		dataset_card: dataset_card/avsd_dialogue.md # path to the dataset card
		data_type: features # [images|videos|features] we use features in this case 用于提取视频特征
		
		build_info:
			annotations:
				train:
					url: /export/home/data/avsd/train_set4DSTC7-AVSD.json
					storage: avsd/annotations/train.json
				eval:
					url: /export/home/data/avsd/valid_set4DSTC7-AVSD.json
					storage: avsd/annotations/val.json
				test:
					url: /export/home/data/avsd/test_set4DSTC7-AVSD.json
					storage: avsd/annotations/test.json
			features:
				storage: /export/home/data/avsd/features/

数据集 card

设置数据集配置的一个可选的步骤是定义一个数据集卡片，其中包含有关数据集的更多详细信息，例如描述、任务和指标。例如，我们可以为AVsD基准测试在 dataset_card/avsd_dialogue.md 中定义一个数据集卡片。根据数据集的不同，我们可能会在其对应的数据集卡片中包含自动下载数据的命令（使用在 laviz.datasets.download_scripts 中定义的Python代码），这将自动加载数据并将其存储在特定文件夹中。否则，您应该在数据集卡片中描述从原始数据源下载数据的外部说明，以正确加载数据集。

AVSD benchmark 数据集卡片示例：

![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")

# Audio-Visual Scene-Aware Dialogues (AVSD)
## Description
[Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided.
## Task
在一个视频基础对话任务中，系统必须根据给定对话的上下文生成对用户输入的响应。
这个上下文包括对话历史（用户和系统之前的发言）以及构成场景的视频和音频信息。
使用客观措施评估系统自动生成句子的质量，以确定生成的响应是否自然且富有信息。
## Metrics
模型通常根据 [BLEU]、[CIDER]、[METEOR] 和 [ROUGE-L] 指标进行评估。
## Leaderboard
....
## Auto-Downloading
Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instructions to download the dataset.
## References

视觉数据类型

我们目前将视觉数据类型限制为三种选项之一：images, videos, and features.

Images and videos 指的是原始视觉数据，适合在原始形式下处理视觉数据的模型（例如ViT模型）。

Features 是从预训练模型（例如CNN模型）中提取的视觉表示。

这里AVSD基准测试由从3D-CNN模型中提取的视频特征组成。
Build Info

Build info指的是数据存储和缓存的具体位置。

对于文本注释（例如标题或对话），默认情况下，我们包括三个数据分割，即 train，val，test，这些通常在所有机器学习项目中使用。

对于每个分割，我们指定2个参数：url 和 storage。

url 可以是 online URL，数据可以从中自动加载（例如从googleapis），或者是数据已经事先下载的本地目录。

storage 是随着时间的推移缓存数据的目录，避免了重复下载数据。

对于视觉数据注释，确保字段名称与之前定义的数据类型匹配 images, videos, and features。

由于视觉特征通常很大，应该事先下载，我们只维护一个存储参数，用于缓存视觉数据。

Base Dataset lavis.datasets.datasets.base_dataset

继承 lavis.datasets.datasets.base_dataset 定义新的数据集

base dataset 类已经定义了一些标准方法，例如从pytorch中使用默认的 collator

import json
from typing import Iterable

from torch.utils.data import Dataset, ConcatDataset
from torch.utils.data.dataloader import default_collate

# 默认BaseDataset的实现
class BaseDataset(Dataset):
    def __init__(self, vis_processor = None, text_processor = None, vis_root = None, ann_paths = []):
        ''' 
        vis_root: 图片根目录
        ann_root：存储注释文件的目录
        '''
        self.vis_root = vis_root
        self.annotation = []
        for ann_path in ann_paths:
            self.annotation.extend(json.load(open(ann_path, "r")))
        self.vis_processor = vis_processor
        self.text_processor = text_processor
        self._and_instance_ids()
    def __len__(self):
        return len(self.annotation)
    def collater(self, samples):
        return default_collate(samples)
    def set_processors(self, vis_processor, text_processor):
        self.vis_processor = vis_processor
        self.text_processor = text_processor
    def _and_instance_ids(self, key="instance_id"):
        for idx, ann in enumerate(self.annotation):
            ann[key] = str(idx)

任何数据集子类都将继承这些方法，并且可以根据数据集的规格选择性地定义和重写这些方法。

我们鼓励用户不要修改基础数据集类，因为任何修改都会对继承这个基础数据集的其他数据集类产生连锁影响。

相反，用户应该独立创建新的数据集类来满足他们的特定需求。

创建新的对话数据集

对于AVSD数据集，定义一个新的数据集子类 DialogueDataset用于对话任务

定义在lavis.datasets.datasets.dialogue_datasets

import os
from collections import OrderedDict

from lavis.datasets.datasets.base_dataset import BaseDataset
import json
import copy

class DialogueDataset(BaseDataset):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        self.vis_root = vis_root
        self.annotation = []
        for ann_path in ann_paths:
            dialogs = json.load(open(ann_path, "r"))['dialogs']
            for dialog in dialogs:
                all_turns = dialog['dialog']
                dialog_context = []
                for turn in all_turns:
                    dialog_instance = copy.deepcopy(dialog)
                    question = turn['question']
                    answer = turn['answer']
                    dialog_instance['dialog'] = copy.deepcopy(dialog_context)
                    dialog_instance['question'] = question
                    dialog_instance['answer'] = answer
                    self.annotation.append(dialog_instance)
                    dialog_context.append(turn)
        self.vis_processor = vis_processor
        self.text_processor = text_processor
        self._add_instance_ids()

        self.image_ids = []
        n = 0
        for ann in self.annotation:
            img_id = ann["image_id"]
            if img_id not in self.image_ids.keys():
                self.image_ids[img_id] = n 
                n += 1

如果我们想仅用于测试的对话数据集，可以定义另一个我们可以定义另一个数据集类 DialogueEvalDataset，其定义方式与上面类似，但注释的处理方式不同。

通常，在对话任务中，在测试时，每个对话只构建一个单独的测试样本（而不是在训练时将所有对话轮次分解为样本）。然后可以这样定义数据集类：

class DialogueEvalDataset(BaseDataset):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        ...
        # 与上述不同之处在于对话注释
        self.annotation = []
        for ann_path in ann_paths:
            dialogs = json.load(open(ann_path, "r"))['dialogs']
            for dialog in dialogs:
                all_turns = dialog['dialog']
                dialog_context = all_turns[:-1]
                last_turn = all_turns[-1]
                question = last_turn['question']
                answer = last_turn['answer']
                dialog['dialog'] = dialog_context
                dialog['question'] = question
                dialog['answer'] = answer
                self.annotation.append(dialog)

使用类继承定义数据集允许开发更多细粒度的类实现，每一个都被特别指定用于基准测试。

例如，在基于对话的任务中，我们可以进一步定义另一个数据集子类，用于AVSD数据集。

定义新的类 AVSDDiaDataset进一步指定如何加载样本以及根据具体要求整理它们 collate

import os
from lavis.datasets.datasets.base_dataset import BaseDataset
from lavis.datasets.datasets.dialogue_datasets import DialogueDataset, DialogueEvalDataset
import torch

class AVSDDiaDataset(DialogueDataset):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        super().__init__(vis_processor, text_processor, vis_root, ann_paths)
    def __getitem__(self, index):
        ann = self.annotation[index]
        vname = ann['image_id']
        video = self.vis_processor(self.vis_root, vname)
        dialogue = self.text_processor(ann)
        return {
            "video_fts": video['video_fts'],
            "video_token_type_ids": video['token_type_ids'],
            "input_ids": dialogue['input_ids'],
            "token_type_ids": dialogue['token_type_ids'],
            "labels": dialogue['labels'],
            "image_id": ann["image_id"],
            "instance_id": ann["instance_id"]
        }
        
    def collater(self, samples):
        input_ids, token_type_ids, labels, video_fts, video_token_type_ids = [], [], [], [], []
        for i in samples:
            input_ids.append(i['input_ids'])
            token_type_ids.append(i['token_type_ids'])
            labels.append(i['labels'])
            video_fts.append(i['video_fts'])
            video_token_type_ids.append(i['video_token_type_ids'])

        input_ids = self.text_processor.padding(input_ids)
        labels = self.text_processor.padding(labels, -1)
        video_fts = self.vis_processor.padding(video_fts)
        
        token_type_ids = self.text_processor.padding(token_type_ids)
        video_token_type_ids = self.text_processor.padding(video_token_type_ids)
        token_type_ids = torch.cat([video_token_type_ids, token_type_ids], dim=1)

        attn_mask = self.text_processor.get_attention_mask(input_ids)
        video_mask = self.vis_processor.get_attention_mask(video_fts)
        attn_mask = torch.cat([video_mask, attn_mask], dim=1)

        video_labels = torch.ones((video_fts.size(0), video_fts.size(1))).long() * -1 # ignore token indice -1 by default

        labels = torch.cat([video_labels, labels], dim=1)

        samples = {}
        samples['input_ids'] = input_ids
        samples['token_type_ids'] = token_type_ids
        samples['labels'] = labels
        samples['video_fts'] = video_fts
        samples['attn_mask'] = attn_mask

        return samples

by default, we always use the collater from the BaseDataset class to collate data samples.

数据集构建器 Dataset Builder

Dataset Builder 数据处理模块，控制数据集类和将数据集类和特定的数据集配置联系起来。

lavis.datasets.builders.base_dataset_builder

Base Dataset Builder

新的数据构建器继承BaseDatasetBuilder

class BaseDatasetBuilder:
    train_dataset_cls, eval_dataset_cls = None, None
    def __init__(self, cfg=None):
        super().__init__()

        if cfg is None:
            # help to create datasets from default config.
            self.config = load_dataset_config(self.default_config_path())
        elif isinstance(cfg, str):
            self.config = load_dataset_config(cfg)
        else:
            # when called from task.build_dataset()
            self.config = cfg

        self.data_type = self.config.data_type

        self.vis_processors = {"train": BaseProcessor(), "eval": BaseProcessor()}
        self.text_processors = {"train": BaseProcessor(), "eval": BaseProcessor()}

        # additional processors, each specified by a name in string.
        self.kw_processors = {}

仔细查看基本构建器类中定义的标准方法，包括_download_data和build_dataset等方法，这些方法将加载下载数据并创建数据集类的实例：

class BaseDatasetBuilder:
    ...
    def build_datasets(self):
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
        # 主进程下载数据
        if is_main_process():
        	self._download_data()
        if is_dist_avail_and_initialized():
        	dist.barrier()

        # at this point, all the annotations and image/videos should be all downloaded to the specified locations.
        logging.info("Building datasets...")
        datasets = self.build()  # dataset['train'/'val'/'test']
        return datasets

    def _download_data(self):
    self._download_ann()
    self._download_vis()

对话数据集构建器

lavis.datasets.builders.dialogue_builder

from lavis.datasets.builders.base_dataset_builder import BaseDatasetBuilder
from lavis.datasets.datasets.avsd_dialogue_datasets import(
    AVSDDialDataset,
    AVSDDialEvalDataset
)
from lavis.common.registry import registry

@registry.register_builder("avsd_dialogue")
class AVSDDialBuilder(BaseDatasetBuilder):
    train_dataset_cls = AVSDDialDataset
    eval_dataset_cls = AVSDDialEvalDataset

    DATASET_CONFIG_DICT = {
        "default": "configs/datasets/avsd/defaults_dial.yaml"
    }

请注意，我们选择分别定义 train_dataset_cls 和 eval_dataset_cls 参数，以考虑训练和测试时数据处理不同的情况。

例如，在标题生成任务中，测试时每个数据样本通常包括多个 ground-truth 标题，而不是训练时的单一 ground-truth 标题。

如果训练和测试时的数据处理相同，这两个参数可以链接到同一个数据集类。

最后，定义 DATASET_CONFIG_DICT 将数据集配置与分配的数据集类关联起来。

Registering Builder

首先需要在 __init__.py 文件中包含（import）新定义的类。__init__.py 文件是一个特殊的文件，用于将模块标记为Python包，并可以包含包的初始化代码。

from lavis.datasets.builders.dialogue_builder import (
    AVSDDialBuilder
)

__all__ = [
    ...,
    "AVSDDialBuilder"
]

通过在 __init__.py 文件中设置 __all__ 列表，可以指定哪些类或函数将被导出。这意味着当其他模块导入这个包时，这些被列出的类或函数可以直接通过包名访问。

Assigning Builder

在数据加载和处理期间，被分配的生成器必须具有正确的注册表才能正确加载它

例如，应在配置文件中指定以下内容 dialogue_avsd_ft.yaml ，例如：

datasets:
  avsd_dialogue: # name of the dataset builder
    ...
    # processor configuration
    ...

随后，任何进程（例如训练）都应加载此配置文件以分配正确的构建器，然后该构建器将关联正确的数据集类以构建数据样本。

python train.py --cfg-path dialogue_avsd_ft.yaml

总结，自顶向下回顾，添加新的对话任务数据集进行微调

1.）首先训练时需要加载一个配置文件，配置文件中需要正确配置对话数据构建器

2.）创建对话数据构建器，继承自 BaseDatasetBuilder，其中需要添加数据集默认配置文件的映射关系，以及训练数据集类和评估数据集类

3.）自定义任务数据集类，继承BaseDataset，自定义实现细节

3.1 示例-miniGPT4_Qwen自定义数据集

数据集配置，其中配置了数据集构建器的 name string，每个dataset builder下配置视觉处理器和文本处理器的配置信息

datasets:
  minigpt4_instruction: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 100
  llava_instruction: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 100

创建数据集构建器

lavis/datasets/builders/minigpt4qwen_builder.py 文件中构建了 minigpt4_instruction，llava_instruction 的对应的数据集构建器。

数据集构建器中维护了当前数据集对应的类，以及当前数据集默认的配置信息。

训练数据集对应的数据集类都是InstructionDataset，说明两个数据集在训练中的构建方法是一样的。

from lavis.datasets.datasets.minigpt4_instructions import InstructionDataset
from lavis.datasets.builders.base_dataset_builder import BaseDatasetBuilder

@registry.register_builder("minigpt4_instruction")
class Minigpt4InstructionBuilder(BaseDatasetBuilder):
    # 训练数据集类 = InstructionDataset
    train_dataset_cls = InstructionDataset
    DATASET_CONFIG_DICT = {
        'default': 'configs/datasets/minigpt4_instruction/defaults_instruction.yaml'
    }

@registry.register_builder("llava_instruction")
class LlavaInstructionBuilder(BaseDatasetBuilder):
    # 训练数据集类 = InstructionDataset
    train_dataset_cls = InstructionDataset
    DATASET_CONFIG_DICT = {
        'default': 'configs/datasets/llava_instruction/defaults_instruction.yaml'
    }

自定义的数据集

# 自定义指令数据集，继承自 Minigpt4QwenDataset
class InstructionDataset(Minigpt4QwenDataset, __DisplMixin):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        self.vis_root = vis_root

        self.annotation = []
        for ann_path in ann_paths:
            self.annotation.extend(json.load(open(ann_path, "r")))

        self.vis_processor = vis_processor
        self.text_processor = text_processor
        # 调用父类的_add_instance_ids方法
        self._add_instance_ids()

    def __getitem__(self, index):
        ann = self.annotation[index]
        # 图片路径
        image_path = os.path.join(self.vis_root,ann['image'])
        # 获取图片
        image = Image.open(image_path).convert("RGB")
        # 视觉处理器
        image = self.vis_processor(image)
        # 
        if isinstance(ann['instruction'],list):
            instructions = ann['instruction']
            outputs = ann['output']
            conversations = []
            for turn_i, instruction in enumerate(instructions):
                instruction = self.text_processor(instruction)
                output = outputs[turn_i]
                conversations.extend(
                    [
                        {"from": "user", "value":instruction},
                        {"from": "assistant", "value": output},
                    ]
                )
        else:
            instruction = self.text_processor(ann['instruction'])
            output = ann['output']
            conversations = [
                {"from": "user", "value":instruction},
                {"from": "assistant", "value": output},
            ]
        # 返回图像及会话   
        return {
            "image": image,
            "conversations": conversations,
        }
# 自定义数据集，继承自BaseDataset
class Minigpt4QwenDataset(BaseDataset):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        super().__init__(vis_processor, text_processor, vis_root, ann_paths)

    def collater(self, samples):
        image_list, conversation_list = [], []
        num_answers = []
        # 返回图片和会话列表
        for sample in samples:
            if isinstance(sample['image'],list):
                image_list.extend(sample['image'])
            else:
                image_list.append(sample["image"])
            conversation_list.append(sample["conversations"])

        return {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }

默认的数据集配置信息

# configs/datasets/minigpt4_instruction/defaults_instruction.yaml
datasets:
  minigpt4_instruction:	# 数据集的 name string
    # data_dir: ${env.data_dir}/datasets
    data_type: images # [images|videos|features]

    build_info:
      # Be careful not to append minus sign (-) before split to avoid itemizing
      annotations:
        train:
          url: /root/autodl-tmp/cache/dataset/minigpt4/minigpt4_minigpt4qwen_format.json
          storage: /root/autodl-tmp/cache/dataset/minigpt4/minigpt4_minigpt4qwen_format.json
      images:
          storage: /root/autodl-tmp/cache/dataset/minigpt4/image

# configs/datasets/llava_instruction/defaults_instruction.yaml
datasets:
  llava_instruction:
    # data_dir: ${env.data_dir}/datasets
    data_type: images # [images|videos|features]

    build_info:
      # Be careful not to append minus sign (-) before split to avoid itemizing
      annotations:
        train:
          url: /root/autodl-tmp/cache/dataset/llava/llava_minigpt4qwen_format.json
          storage: /root/autodl-tmp/cache/dataset/llava/llava_minigpt4qwen_format.json
      images:
          storage: /root/autodl-tmp/cache/dataset/llava/image

3.2 自定义处理器 Processors

使用 lavis.processors模块自定义新的处理器。

LAVis 库包括一个标准处理器模块，用于预处理数据，例如图像转换和序列连接。

在本教程中，演示添加针对视频基础对话任务的视觉和文本处理器。

此外，我们也希望处理器具有处理特征，使数据样本与 GPT 风格的模型兼容。

基础处理器 Base Processor

lavis.processors.base_processors

新处理器的定义应该继承基础处理器 BaseProcessor

# OmegaConf 是一个用于处理配置文件的库
from omegaconf import OmegaConf

class BaseProcessor:
    def __init__(self):
        # 初始化为一个 lambda 函数，该函数接收一个参数 x 并返回它本身。
        #  这意味着默认情况下，处理器不会对数据进行任何转换
        self.transform = lambda x: x
        return

    def __call__(self, item):
        # 当实例被调用时，它会应用 self.transform 函数
        # self.transform不会对item做任何操作，直接返回
        return self.transform(item)

    @classmethod
    def from_config(cls, cfg=None):
        # 根据配置文件创建实例
        return cls()

    def build(self, **kwargs):
        # 将关键字参数转换为一个配置对象
        cfg = OmegaConf.create(kwargs)
        # 根据配置对象创建实例
        return self.from_config(cfg)

GPT风格处理器 GPT-style Processors

lavis.processors.gpt_processors

定义新的处理器类

例如在 lavis.processors.gpt_processors 下，为专门为基于视频对话设计的GPT模型。

我们假设视频特征已经事先提取好了，这个处理器只是简单地从 npy 文件中加载特征。

其他特别定义的方法包括padding（由数据集实例用于填充多个视频样本）和 get_attention_mask（为GPT模型中的Transformer注意力创建注意力掩码）。

通过定义GPTVideoFeatureProcessor 类来处理视频特征

# 定义了一些特殊令牌，这些令牌将被添加到 GPT 模型的分词器中
SPECIAL_TOKENS_DICT = {'bos_token': "<bos>", 'eos_token': "<eos>", 'additional_special_tokens': ["<speaker1>", "<speaker2>", "<video>", "<cap>"], 'pad_token': "<pad>"}
...

# 注册了一个新的处理器类, name string 为 gpt_video_ft
@registry.register_processor("gpt_video_ft")
class GPTVideoFeatureProcessor(BaseProcessor):
    def __init__(self, visual_ft, audio_ft):
        # 接收视觉和音频特征
        self.visual_ft = visual_ft
        self.audio_ft = audio_ft
        # 初始化一个 GPT2Tokenizer 实例，并添加特殊令牌。
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)

    def padding(self, seq):
        padded_seq = torch.nn.utils.rnn.pad_sequence(seq, batch_first=True, padding_value=1.0)
        return padded_seq

    def get_attention_mask(self, seq):
        # 填充的值为1.0，
        return torch.sum(seq != 1, dim=2) != 0

    def __call__(self, ft_root, vname):
        all_ft = []
        # 读取视觉特征
        for ft_name in self.visual_ft:
            ft_path = os.path.join(ft_root, ft_name, vname)
            all_ft.append(np.load(ft_path + '.npy'))
        # 读取音频特征
        for ft_name in self.audio_ft:
            ft_path = os.path.join(ft_root, ft_name, vname)
            all_ft.append(np.load(ft_path + '.npy'))
        # 计算所有特征数组的最短长度
        min_len = min([len(ft) for ft in all_ft])
        # 截断每个特征数组到最短长度
        sampled_ft = [ft[:min_len] for ft in all_ft]
        # 将截断后的特征数组沿着第二维合并成一个大的特征数组
        sampled_ft = np.concatenate(sampled_ft, axis=1)
        
        item = {}
        # 视觉特征
        item['video_fts'] = torch.Tensor(sampled_ft)
        # 视觉类型
        video_type_token = self.tokenizer.convert_tokens_to_ids('<video>')
        item['token_type_ids'] = torch.Tensor([video_type_token] * len(sampled_ft)).long()
        
        return item

    @classmethod
    def from_config(cls, cfg=None):
        # 从配置文件创建处理器实例
        if cfg is None:
            cfg = OmegaConf.create()

        visual_ft = cfg.get("visual_ft", ["i3d_rgb"])
        audio_ft = cfg.get("audio_ft", ["vggish"])

        return cls(
            visual_ft=visual_ft,
            audio_ft=audio_ft
        )

另一个有用的处理器类可用于处理对话数据

定义一个 GPTDialogueProcessor 类。这个处理器类接收原始注释，并构造输入作为输入序列（questions, dialogue contexts, and responses）的串联，以便于在 GPT 模型中应用。

其他特别定义的方法包括 padding 和 get_attention_mask。

SPECIAL_TOKENS = [
    "<bos>",
    "<eos>",
    "<speaker1>",
    "<speaker2>",
    "<cap>",
    "<video>",
    "<pad>",
]

# 
SPECIAL_TOKENS_DICT = {'bos_token': "<bos>", 'eos_token': "<eos>", 'additional_special_tokens': ["<speaker1>", "<speaker2>", "<video>", "<cap>"], 'pad_token': "<pad>"}
...

# 注册新的处理器类, name string 为 gpt_dialogue
@registry.register_processor("gpt_dialogue")
class GPTDialogueProcessor(BaseProcessor):
    def __init__(self, max_turns=3, use_caption=True):
        # 对话历史中使用的最大轮数
        self.max_turns = max_turns
        # 是否使用标题信息
        self.use_caption = use_caption
        # 分词器初始化，并添加特殊令牌
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)

    def sample_sequence(self, caption, history, answer):
        # 构造输入序列
        bos, eos, speaker1, speaker2, cap = self.tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[:-2])
        instance = {}
        sequence = [caption] + history + [answer]
        # 每个部分的末尾添加结束令牌（EOS）
        sequence = [s + [eos] for s in sequence]

        instance["input_ids"] = list(chain(*sequence))
        instance["token_type_ids"] = [cap] * len(sequence[0]) + [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence[1:]) for _ in s]
        instance["labels"] = ([-1]*sum(len(s) for s in sequence[:-1])) + sequence[-1]

        assert len(instance["input_ids"])==len(instance["token_type_ids"])
        assert len(instance["token_type_ids"])==len(instance["labels"])

        for k,v in instance.items():
            instance[k] = torch.Tensor(v).long()
        return instance

    def padding(self, seq, pad_token=-1):
        if pad_token==-1: pad_token = self.tokenizer.pad_token_id
        padded_seq = torch.nn.utils.rnn.pad_sequence(seq, batch_first=True, padding_value=pad_token)
        return padded_seq

    def get_attention_mask(self, seq, pad_token=-1):
        if pad_token==-1: pad_token = self.tokenizer.pad_token_id
        return seq != pad_token

    def __call__(self, ann):
        # 如果使用标题，将标题和总结合并编码
        if self.use_caption:
            caption = ' '.join([ann['caption'], ann['summary']])
            caption = self.tokenizer.encode(caption)
        else:
            caption = []
        # 构建对话历史	
        dial_history = []
        # 获取最后max_turns轮对话的问题和答案
        for turn in ann['dialog'][-self.max_turns:]:
            dial_history.append(turn['question'])
            dial_history.append(turn['answer'])
        # 当前问题
        dial_history.append(ann['question'])
        dial_history = [self.tokenizer.encode(t) for t in dial_history]
        # 对当前答案进行编码
        answer = self.tokenizer.encode(ann['answer'])
        # 使用标题，对话历史，当前问题答案，构建Tensor字典
        item = self.sample_sequence(caption, dial_history, answer)

        return item

    @classmethod
    def from_config(cls, cfg=None):
        # 从配置文件创建处理器实例
        if cfg is None:
            cfg = OmegaConf.create()
        use_caption = cfg.get("use_caption", True)
        max_turns = cfg.get("max_turns", 3)
        return cls(max_turns=max_turns, use_caption=use_caption)

注册新处理器 Registering New Processors

lavis.processors.__init__

最后，任何新的处理器都必须正式注册为 laviz.processors 模块的一部分。

例如，要为基于 GPT 的对话模型添加处理器类，包括一个用于对话数据的 GPTDialogueProcessor 和一个用于视频特征的 GPTVideoFeatureProcessor，我们可以按以下方式修改 __init__.py 文件：

from lavis.processors.gpt_processors import (
    GPTVideoFeatureProcessor,
    GPTDialogueProcessor,
)

__all__ = [
    ...
    # GPT
    "GPTVideoFeatureProcessor",
    "GPTDialogueProcessor"
]

分配处理器 Assigning Processors

上述处理器类的示例中，请注意我们为每个类定义了一个 from_config 方法。这个方法将处理一个配置文件，并传递特定参数（例如 max_turns、visual_ft），以正确初始化处理器类。

为了做到这一点，我们可以在配置文件中分配/关联正确的处理器类注册表。

例如，在配置文件中应该指定如下内容，例如 dialogue_avsd_ft.yaml：

在 name 出配置处理器类的 name string

datasets:
  avsd_dialogue: # name of the dataset builder
    vis_processor:
        train:
          name: "gpt_video_ft" # name of the visual processor for training data
          visual_ft: ["i3d_flow", "i3d_rgb"]
          audio_ft: ["vggish"]
        eval:
          name: "gpt_video_ft" # name of the visual processor for evaluation data
          visual_ft: ["i3d_flow", "i3d_rgb"]
          audio_ft: ["vggish"]
    text_processor:
        train:
          name: "gpt_dialogue" # name of the textual processor for training data
          max_turns:  3
          use_caption: True
        eval:
          name: "gpt_dialogue" # name of the textual processor for evaluation data
          max_turns:  3
          use_caption: True

3.2 示例-MiniGPT4_Qwen定义处理器

观察MiniGPT4_Qwen Instruction Finetune的配置文件

可以看到其中 vis_processor 视觉处理器的 name string 为 blip2_image_train

这里应该是LAVIS库中已经实现好的

datasets:
  minigpt4_instruction: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 100
  llava_instruction: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 100

有了处理器的name string，我们可以到laviz.processors下查看对应的处理器

# 注册新的处理器类
@registry.register_processor("blip2_image_train")
class Blip2ImageTrainProcessor(BlipImageBaseProcessor):
    def __init__(
        self, image_size=364, mean=None, std=None, min_scale=0.5, max_scale=1.0
    ):
        '''
        image_size:  随机裁剪图像的目标大小
        mean: 图像均值，用于标准化
        std: 图像标准差，用于标准化
        min_scale: 随机裁剪的最小比例
        max_scale: 随机裁剪的最大比例
        '''
        # 父类的构造函数，传入均值和方差
        super().__init__(mean=mean, std=std)
        
        self.transform = transforms.Compose( 	# 定义一个图像转换操作序列
            [
                transforms.RandomResizedCrop(	# 随机调整图像大小并裁剪到指定的 image_size
                    image_size,
                    scale=(min_scale, max_scale),
                    interpolation=InterpolationMode.BICUBIC,
                ),
                transforms.RandomHorizontalFlip(),	# 随机水平翻转图像
                transforms.ToTensor(),			# 转换为Tensor
                self.normalize,					# 应用之前通过父类初始化的标准化
            ]
        )

    def __call__(self, item):
        # 调用处理器，即应用transform方法
        return self.transform(item)

    @classmethod
    def from_config(cls, cfg=None):
        if cfg is None:
            cfg = OmegaConf.create()
        image_size = cfg.get("image_size", 364)
        mean = cfg.get("mean", None)
        std = cfg.get("std", None)
        min_scale = cfg.get("min_scale", 0.5)
        max_scale = cfg.get("max_scale", 1.0)
        return cls(
            image_size=image_size,
            mean=mean,
            std=std,
            min_scale=min_scale,
            max_scale=max_scale,
        )


class BlipImageBaseProcessor(BaseProcessor):
    def __init__(self, mean=None, std=None):
        if mean is None:
            mean = (0.48145466, 0.4578275, 0.40821073)
        if std is None:
            std = (0.26862954, 0.26130258, 0.27577711)
        # 使用给定的均值标准差，创建一个标准化转换对象
        self.normalize = transforms.Normalize(mean, std)

3.3 添加新模型

使用lavis.models模块添加新模型

LAVis 库包括一个标准模型模块，为许多主要的语言-视觉模型（如 ALBEF、BLIP、ALPRO 和 CLIP）构建了基础

以下演示为视频基础对话任务添加一个 GPT 风格的模型

基础模型 Base Model

lavis.models.base_model

任何新的模型定义应该继承基础模型类 BaseModel

from omegaconf import OmegaConf

import numpy as np

import torch
import torch.nn as nn

from lavis.common.utils import get_abs_path

# BaseModel 继承自 nn.Module
class BaseModel(nn.Module):
    """Base class for models."""
    def __init__(self):
        super().__init__()
    # 前向特征方法
    def forward_features(self, *args, **kwargs):
        """Similar to *forward* but only return features."""
        # 类似于forward，但是只返回模型的特征输出
        raise NotImplementedError
    # 从预训练模型加载权重
    def load_from_pretrained(self, url_or_filename):
        raise NotImplementedError

    @classmethod
    def _from_config(cls, cfg=None, model_type="base"):
        if not cfg:
            # useful when building model without a provided configuration file
            cfg = OmegaConf.load(cls.default_config_path(model_type)).model
        return cls.from_config(cfg)

    @classmethod
    def from_pretrained(cls, model_type="base"):
        """
        Build a pretrained model from the default configuration file, specified by model_type.
        """
        # 根据配置文件和特定的模型类型，构建预训练模型
        return cls._from_config(cfg=None, model_type=model_type)

    @property
    def device(self):
        # 返回模型参数所在的设备
        return list(self.parameters())[0].device

    @classmethod
    def default_config_path(cls, model_type="base"):
        assert (
            model_type in cls.PRETRAINED_MODEL_CONFIG_DICT
        ), "Unknown model type {}".format(model_type)
        return get_abs_path(cls.PRETRAINED_MODEL_CONFIG_DICT[model_type])
    # 评估前的准备步骤
    def before_evaluation(self, **kwargs):
        pass
    # 显示模型的参数量
    def show_n_params(self, return_str=True):
        tot = 0
        for p in self.parameters():
            w = 1
            for x in p.shape:
                w *= x
            tot += w
        if return_str:
            if tot >= 1e6:
                return "{:.1f}M".format(tot / 1e6)
            else:
                return "{:.1f}K".format(tot / 1e3)
        else:
            return tot

GPT-style 基于视频的对话模型

lavis.models.gpt_models.gpt_dialogue

定义一个新的模型类，在 laviz.models.gpt_models.gpt_dialogue 下，基于GPT的对话模型专门用于基于视频的对话任务。

需要注意的是，我们假设模型类继承自来自 transformers 库的标准模型超类 GPT2LMHeadModel。

我们还通过继承来自 LAVis 库的 BaseModel 作为次要超类，强制模型集成到 LAVis 框架中。

import torch
from lavis.common.registry import registry
from lavis.models.base_model import BaseModel

from transformers import GPT2Model, GPT2LMHeadModel
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
import math
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss

@registry.register_model("gpt_dialogue")
class GPTDialogue(GPT2LMHeadModel, BaseModel):
    ...

下一步在模型初始化时调整架构以适配基于视频对话的任务，

我们希望为线性网络添加额外的模型参数，以转换视频特征重新表示为模型维度。

class GPTDialogue(GPT2LMHeadModel, BaseModel):
    def __init__(self, config, len_video_ft=4224):
        super().__init__(config)
        
        self.video_ff = nn.Linear(len_video_ft, config.n_embd)
        
        # Model parallel
        self.model_parallel = False
        self.device_map = None

        # Initialize weights and apply final processing
        self.post_init()

对于每个新的模型类，建议重新定义从 BaseModel 类继承的 from_config 方法。

由于每个模型通常具有其自己的独特配置，重新定义该方法将确保正确创建模型实例。

例如，GPTDialogue 需要一个额外的视频特征长度参数（num_video_ft），这应该是模型初始化过程的一部分。另一个额外的参数是令牌/单词的数量（因为我们在对话任务的词汇表中包括了额外的特殊令牌）。

class GPTDialogue(GPT2LMHeadModel, BaseModel):
    ...
    @classmethod
    def from_config(cls, cfg):
        # 该方法根据配置文件创建模型实例
        model = cls.from_pretrained('gpt2', len_video_ft=cfg['len_video_ft'])
        model.resize_token_embeddings(cfg['len_tokenizer'])
        return model

在新的模型类中还应明确定义前向传播forward函数。

例如，在针对基于视频对话任务的 GPT 模型中，我们希望forward 操作还包括将表示传递给 Transformer 层之前对视频特征进行转换和整合。

class GPTDialogue(GPT2LMHeadModel, BaseModel):
    ...

    def forward(self, samples,
                past_key_values=None,
                position_ids=None,
                head_mask=None,
                encoder_hidden_states=None,
                encoder_attention_mask=None,
                use_cache=None,
                output_attentions=None,
                output_hidden_states=None,
                return_dict=None):
                # 输入嵌入：使用GPT-2模型的分词器(wte，即词嵌入矩阵)将输入 ID 转换为嵌入向量
                input_embs = self.transformer.wte(samples['input_ids'])
                # 视频特征嵌入：将视频特征转换为嵌入向量
                video_embs = self.video_ff(samples['video_fts'])
                # 将视频特征嵌入和文本输入嵌入沿第二维（通常是特征维度）拼接起来
                input_embs = torch.cat([video_embs, input_embs], dim=1)
                # 调用GPT-2模型的Transformer部分
                transformer_outputs = self.transformer(
                    attention_mask=samples['attn_mask'],
                    token_type_ids=samples['token_type_ids'],
                    inputs_embeds=input_embs,
                    position_ids=position_ids,
                    head_mask=head_mask,
                    encoder_hidden_states=encoder_hidden_states,
                    encoder_attention_mask=encoder_attention_mask,
                    use_cache=use_cache,
                    output_attentions=output_attentions,
                    output_hidden_states=output_hidden_states,
                    return_dict=return_dict,
                )
                # 获取隐藏状态，用于生成语言模型的预测
                hidden_states = transformer_outputs[0]
                # lm_head 是 GPT-2模型的语言模型头，用于生成最终的预测（例如，下一个词的概率分布）
                lm_logits = self.lm_head(hidden_states)
                ...

完整代码-GPTDialogue

from torch.nn import CrossEntropyLoss, MSELoss
from transformers import GPT2LMHeadModel
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions

# 注册模型类 name string 为 gpt_dialogue
@registry.register_model("gpt_dialogue")
class GPTDialogue(BaseModel, GPT2LMHeadModel):
    # 指定预训练模型的配置文件路径
    PRETRAINED_MODEL_CONFIG_DICT = {"base": "configs/models/gpt_dialogue_base.yaml"}

    def __init__(self, config, len_video_ft=4224):
        super().__init__(config)
        # 视频特征处理层：两个线性层
        # 将视频特征转换为与文本嵌入相同维度的向量
        # 们转换回视频特征的原始维度。
        self.video_ff = nn.Linear(len_video_ft, config.n_embd)
        self.video_ff_out = nn.Linear(config.n_embd, len_video_ft)

        # 模型并行设置
        self.model_parallel = False
        self.device_map = None

        # 初始化权重和应用最终处理
        self.post_init()

    def forward(
        self,
        samples,
        past_key_values=None,
        position_ids=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        # 使用GPT-2模型的分词器将输入ID转换为嵌入向量（wte 即词嵌入矩阵）
        input_embs = self.transformer.wte(samples["input_ids"])
        # 使用定义的视频特征处理层将视频特征转换为嵌入向量
        video_embs = self.video_ff(samples["video_fts"])
        # 将视频特征嵌入和文本输入嵌入沿第二维拼接起来
        input_embs = torch.cat([video_embs, input_embs], dim=1)

        transformer_outputs = self.transformer(
            attention_mask=samples["attn_mask"],
            token_type_ids=samples["token_type_ids"],
            inputs_embeds=input_embs,
            position_ids=position_ids,
            head_mask=head_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        # 获取隐藏状态
        hidden_states = transformer_outputs[0]
        # 使用GPT-2模型的语言模型头生成最终的预测
        lm_logits = self.lm_head(hidden_states)

        loss = None
        # 如果提供了label则计算损失
        if samples["labels"] is not None:
            # Shift so that tokens < n predict n
            # 将模型输出的logits向左移动一位，使得每个词的logits用于预测其后的词。
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = samples["labels"][..., 1:].contiguous()
            # Flatten the tokens
            # 使用交叉熵损失计算loss
            loss_fct = CrossEntropyLoss(ignore_index=-1)
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
            )
        # 计算视频特征的损失，并将其与语言模型损失合并
        if samples["video_fts"] is not None:
            len_video_fts = samples["video_fts"].shape[1]
            # 将视觉特征转换回原始维度
            video_logits = self.video_ff_out(hidden_states[:, :len_video_fts, :])
            # Shift so that tokens < n predict n
            # 转换回的logits与原来的视频特征移位
            shift_logits = video_logits[..., :-1, :].contiguous()
            shift_labels = samples["video_fts"][..., 1:, :].contiguous()
            # Flatten the tokens
            # 使用均方误差损失计算loss
            loss_fct = MSELoss(reduction="mean")
            video_loss = loss_fct(shift_logits, shift_labels)
            
            if loss is not None:
                loss = loss + video_loss
            else:
                loss = video_loss
        # 现在有了loss, lm_logits, past_key_value, hidden_states, attentions, cross_attentions
        # 交叉注意力机制？
        return CausalLMOutputWithCrossAttentions(
            loss=loss,
            logits=lm_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
            cross_attentions=transformer_outputs.cross_attentions,
        )

    @classmethod
    def from_config(cls, cfg):
        model = cls.__bases__[1].from_pretrained("gpt2")
        model.resize_token_embeddings(cfg["len_tokenizer"])
        return model

补充代码-GPT2LMHeadModel, transformers代码

# GPT2LMHeadModel代码
from transformers import GPT2LMHeadModel

@add_start_docstrings(
    """
    The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input
    embeddings).
    """,
    GPT2_START_DOCSTRING,
)
class GPT2LMHeadModel(GPT2PreTrainedModel):
    _tied_weights_keys = ["lm_head.weight"]
    def __init__(self, config):
        super().__init__(config)
        # GPT2模型
        self.transformer = GPT2Model(config)
        # 语言模型头 (n_embd, vocab_size)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Model parallel
        self.model_parallel = False
        self.device_map = None

        # Initialize weights and apply final processing
        # GPT2LMHeadModel模型并未实现该方法，其父类GPT2PreTrainedModel也没有实现该方法
        # 
        self.post_init()

        
# GPT2PreTrainedModel 的父类 PreTrainedModel实现了该方法
class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMixin, PeftAdapterMixin):
    ...
    def post_init(self):
        """
        A method executed at the end of each Transformer model initialization, to execute code that needs the model's
        modules properly initialized (such as weight initialization).
        """
        # 初始化权重
        self.init_weights()
        self._backward_compatibility_gradient_checkpointing()
        
    def init_weights(self):
        # 用于执行权重的初始化和剪枝
        """
        If needed prunes and maybe initializes weights. If using a custom `PreTrainedModel`, you need to implement any
        initialization logic in `_init_weights`.
        """
        # Prune heads if needed
        # 剪枝头
        if self.config.pruned_heads:
            self.prune_heads(self.config.pruned_heads)

        if _init_weights:
            # 权重初始化
            # PreTrainedModel 基类并没有实现
            self.apply(self._initialize_weights)

            # Tie weights should be skipped when not initializing all weights
            # since from_pretrained(...) calls tie weights anyways
            # 对模型中的一些权重进行绑定
            # 用于确保模型中的某些参数（如语言模型的输入和输出嵌入矩阵）在训练过程中保持一致。这有助于模型学习更好的表示。
            self.tie_weights()
            
        # 实际的权重初始化逻辑并没有实现
        # 在这里，GPT2PreTrainedModel实现了具体的权重初始化
        def _init_weights(self, module):
            """
            Initialize the weights. This method should be overridden by derived class and is
            the only initialization method that will be called when loading a checkpoint
            using `from_pretrained`. Any attempt to initialize outside of this function
            will be useless as the torch.nn.init function are all replaced with skip.
            """
            pass

        def _initialize_weights(self, module):
            """
            Initialize the weights if they are not already initialized.
            """
            if getattr(module, "_is_hf_initialized", False):
                return
            self._init_weights(module)
            module._is_hf_initialized = True


class GPT2PreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = GPT2Config
    load_tf_weights = load_tf_weights_in_gpt2
    base_model_prefix = "transformer"
    is_parallelizable = True
    supports_gradient_checkpointing = True
    _no_split_modules = ["GPT2Block"]
    _skip_keys_device_placement = "past_key_values"

    def __init__(self, *inputs, **kwargs):
        super().__init__(*inputs, **kwargs)

    def _init_weights(self, module):
        """Initialize the weights."""
        # nn.Linear和Conv1D，使用正态分布初始化，均值为0.0，标准差为self.config.initializer_range，应该时0.02
        if isinstance(module, (nn.Linear, Conv1D)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        # nn.Embedding，使用正态分布初始化，均值为0.0，标准差为0.02
        # 如果存在填充索引，则将填充索引对应的权重初始化为 0
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        # nn.LayerNorm，偏置初始化为 0，权重初始化为 1.0
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

        # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
        #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
        #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
        #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
        #
        # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
        # GPT-2特殊权重初始化
        # c_proj 应该是attn_output, ffn_output所应用的 nn.Linear
        for name, p in module.named_parameters():
            if name == "c_proj.weight":
                # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
                p.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.n_layer)))

标签：__,self,processor,LAVIS,Qwen,video,MiniGPT4,model,image
From： https://www.cnblogs.com/mudou/p/18314953

LAVIS库学习及MiniGPT4-Qwen中的实现

LAVIS库

一、lavis库介绍

二、体验示例

Image Captioning

Visual question answering (VQA)

Unified Feature Extraction Interface

加载数据集

在任务数据集上评估预训练模型

微调 BLIP在COCO-Captioning数据集

深度剖析

模型配置

数据集配置

三、lavis自定义模块

3.1 自定义数据集 Datasets

数据集配置

创建新的对话数据集

数据集构建器 Dataset Builder

3.1 示例-miniGPT4_Qwen自定义数据集

3.2 自定义处理器 Processors

基础处理器 Base Processor

GPT风格处理器 GPT-style Processors

注册新处理器 Registering New Processors

分配处理器 Assigning Processors

3.2 示例-MiniGPT4_Qwen定义处理器

3.3 添加新模型

基础模型 Base Model

GPT-style 基于视频的对话模型

相关文章

赞助商

阅读排行