多模态大模型 LLaVA 微调教程-大语言模型8

标签：模态 llava 7b -- 模型 LLaVA v1.5 checkpoints model

写完之后发现他好像不是很需要这个东西，所以就先发在自己的博客好了。不投稿首页或者候选区应该本来也就不会有多少流量，所以应该不会干嘛的，大不了后面被说不让放网上以后就删掉这篇，嘻嘻。

LLaVA 是最早出现的 Vision Language Model。本教程将教你微调 llava-v1.5-13b 。与本博客现有的基于xtuner的微调教程不同，这个教程将使用deepspeed以拜托对书生生态的依赖。

配置环境

配置环境的官方教程即项目ReadMe

首先我们下载LLaVA的源代码

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pwd

然后配置Python环境。如果是在自己电脑上运行，请不要忘记创建conda虚拟环境

# conda create -n llava python=3.10 -y
# conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

最后是下载模型。你可以使用huggingface-cli直接下载模型。如果您所在的区域不能直接访问Hugging Face，则需要使用镜像网站下载

# 如果不能访问Hugging Face，可以执行下面这一行设置使用hf-mirror镜像站下载 HF_ENDPOINT=https://hf-mirror.com
# export HF_ENDPOINT=https://hf-mirror.com

# 下载 llava-v1.5-7b 模型权重
huggingface-cli download "liuhaotian/llava-v1.5-7b" --local-dir "./checkpoints/llava-v1.5-7b"

# 下载 clip-vit-large-patch14-336 模型权重
huggingface-cli download "openai/clip-vit-large-patch14-336" --local-dir "./checkpoints/clip-vit-large-patch14-336"

准备训练数据

官方预训练（训练投影层）使用的数据集是 LAION-CC-SBU，视觉微调使用的数据集是llava_v1_5_mix665k.json和其他一些数据集，在项目Readme中写得特别清楚。但是我并不打算在这里进行介绍或者是重新训练个新模型。我们将简单构造一个只有一张图像构成的简易数据集。

自定义训练数据集的格式要求在这里。

首先我们下载图片：

mkdir -p ./playground/data/yuanshen

# 下载图片
wget -O ./playground/data/yuanshen/4.jpg https://avatars.githubusercontent.com/u/86307756

然后准备图文对。这里只准备一个：

import json

dataset_content = """
[
    {
        "id": "yuanshen-628d-4724-b370-b84de974a19f",
        "image": "yuanshen/1.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWho is in the picture?"
            },
            {
                "from": "gpt",
                "value": "The person in the picture is Nathida, who is a character in the Original God and its derivative works produced by Mihoyo. Her real name is Buyel, the grass god in the \"Earthly Seven rulers\", and is given the nickname of \"Little Lucky Grass King\" by the XuMi people, the youngest of the seven gods today. "
            }
        ]
    }
]
"""

with open("./playground/data/yuanshen.json", "w") as f:
    f.write(dataset_content)

数据集图像为：

原神纳西妲

模型微调

这一步我们使用 deepspeed zero2 进行模型 LoRA 微调。得到的微调模型会被保存在./checkpoints/llava-v1.5-7b-lora里。

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/yuanshen.json \
    --image_folder ./playground/data \
    --vision_tower ./checkpoints/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-lora \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 10 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 2 \
    --lazy_preprocess True \
    --report_to wandb

如果在这一步遇到错误，请移步Github issue查看有没有人和你碰到过一样的问题。如果核查确认没有可以试着提新issue。

等待训练完成，我们要将LoRA权重与原始模型权重合并：

python scripts/merge_lora_weights.py --model-path "./checkpoints/llava-v1.5-7b-lora" \
       --model-base "./checkpoints/llava-v1.5-7b" \
       --save-model-path "./checkpoints/llava-v1.5-7b-merged"

这样，就能得到可以直接用于推理的模型了，这个模型现在存储在./checkpoints/llava-v1.5-7b-merged文件夹下

模型测试

测试模型的性能，会发现微调起了作用：

from llava.eval.run_llava import eval_model

model_path = "liuhaotian/llava-v1.5-7b"
prompt = "Who is in the picture?"
image_file = "https://avatars.githubusercontent.com/u/86307756"

args = type('Args', (), {
    "model_path": "./checkpoints/llava-v1.5-7b",
    "model_base": None,
    "model_name": "liuhaotian/llava-v1.5-7b",
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

print("原始模型输出为：")
eval_model(args)

args = type('Args', (), {
    "model_path": "./checkpoints/llava-v1.5-7b-merged",
    "model_base": None,
    "model_name": "liuhaotian/llava-v1.5-7b",
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

print("微调后的模型输出为：")
eval_model(args)

模型经过微调后，对于我们的训练数据，能得到与标签一致的运行结果：

经过微调的模型输出：

The person in the picture is Nathida, who is a character in the Original God and its derivative works produced by Mihoyo. Her real name is Buyel, the grass god in the "Earthly Seven rulers", and is given the nickname of "Little Lucky Grass King" by the XuMi people, the youngest of the seven gods today.

而如果不经过微调，模型只会告诉你照片上有个小女孩。

标签：模态,llava,7b,--,模型,LLaVA,v1.5,checkpoints,model
From： https://www.cnblogs.com/xiangcaoacao/p/18188100

多模态大模型 LLaVA 微调教程-大语言模型8

配置环境

准备训练数据

模型微调

模型测试

相关文章

赞助商

阅读排行