首页 > 其他分享 >基于Mindformers+mindspore框架在升腾910上进行qwen-7b-chat的lora微调

基于Mindformers+mindspore框架在升腾910上进行qwen-7b-chat的lora微调

时间:2024-09-04 10:27:35浏览次数:4  
标签:trainer 13 7b Mindformers mindformers -- py qwen

基于Mindformers+mindspore框架在昇腾910上进行qwen-7b-chat的8卡lora微调

主要参考文档:https://gitee.com/mindspore/mindformers/tree/r1.0/research/qwen

STEP 1: 环境准备

我使用mindformers官方提供的docker镜像进行微调,下载指令:

docker pull swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.0.2_mindspore2.2.13:20240416

启动容器指令参考:

#!/bin/bash
CONTAINER_NAME=mindformers-r1.0
CHECKPOINT_PATH=/var/images/llm_setup/model/qwen/Qwen-7B-Chat
DOCKER_CHECKPOINT_PATH=/data/qwen/models/Qwen-7B-Chat
IMAGE_NAME=swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.0.2_mindspore2.2.13:20240416

docker run -it -u root \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /etc/localtime:/etc/localtime \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /var/log/npu/:/usr/slog \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v ${CHECKPOINT_PATH}:${DOCKER_CHECKPOINT_PATH} \
--name ${CONTAINER_NAME} \
${IMAGE_NAME} \
/bin/bash

环境验证

在命令行中输入如下指令进行验证,

python -c "import mindspore;mindspore.set_context(device_target='Ascend');mindspore.run_check()"

如果输出如下结果则环境没问题:

MindSpore version: 版本号
The result of multiplication calculation is correct, MindSpore has been installed on platform [CPU] successfully!

微调需要的代码下载

微调使用代码大部分来自于mindformers 官方提供,在镜像内执行代码下载及目录进入:

git clone -b r1.0 https://gitee.com/mindspore/mindformers.git
cd mindformers

RANK_TABLE_FILE 生成

开始微调前请先准备多卡微调所需的RANKFILE。用镜像执行需要退出镜像环境在镜像外进行生成:

# 如果容器外没有git clone mindformers代码库,可以通过wget下载需要的代码
wget https://gitee.com/mindspore/models/raw/master/utils/hccl_tools/hccl_tools.py
# 生成rank_table_file文件
python hccl_tools.py --device_num "[0,8)"

将生成的 hccl_8p_01234567_xx.xx.xx.xx.json 文件拷贝到容器内即可进行下面的微调。

STEP 2: 下载模型

由于使用mindformers框架,需要对权重进行转换。目前使用的这个镜像环境进行权重转换有环境上的冲突,无法安装相应的包,所以直接从官网下载转换后的权重、词表文件:

# 权重ckpt 大小29G
wget https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/qwen/qwen_7b_base.ckpt
# 词表文件
wget https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/qwen/qwen.tiktoken

STEP 3: 数据准备

微调qwen模型需要先将数据转换为以下json格式:

  {
    "id": "1",
    "conversations": [
      {
        "from": "user",
        "value": "Give three tips for staying healthy."
      },
      {
        "from": "assistant",
        "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
      }
    ]
  },

然后再转换为适配mindformers的Mindrecord数据,可以使用如下脚本:

python research/qwen/qwen_preprocess.py \
--input_glob /path/alpaca-data-conversation.json \	# 源数据路径(已转换成以上格式)
--model_file /path/qwen.tiktoken \	# 词表路径
--seq_length 2048 \
--output_file /path/alpaca.mindrecord	# 输出mindrecord格式数据路径

结果:

img

STEP 4: 开始微调

注意开始微调前需要执行STEP 1中的RANK_TABLE_FILE生成,确保容器内有 hccl_8p_01234567_xx.xx.xx.xx.json 文件;

启动脚本进行微调,修改yaml文件

启动以下指令进行微调

cd mindformers/research
bash run_singlenode.sh "python qwen/run_qwen.py \
--config qwen/run_qwen_7b_lora.yaml \
--load_checkpoint /data/qwen/models/Qwen-7B-Chat \
--use_parallel True \
--run_mode finetune \
--auto_trans_ckpt True \
--train_data /path/alpaca.mindrecord" \
/data/hccl_8p_01234567_10.17.2.76.json [0,8] 8

其中有如下注意要点:

  • qwen/run_qwen_7b_lora.yaml 中为需要配置的参数文件,需要修改如下内容确保无误:

    load_checkpoint: 'model_dir'    # 使用完整权重,权重按照`model_dir/rank_0/xxx.ckpt`格式存放
    
    model_config:
       seq_length: 2048 # 与数据集长度保持相同
    
    train_dataset: &train_dataset
      data_loader:
        type: MindDataset
        dataset_dir: "/path/alpaca.mindrecord"  # 配置训练数据集文件夹路径
        shuffle: True
    
    pet_config:
       pet_type: lora
       lora_rank: 64
       lora_alpha: 16
       lora_dropout: 0.05
       target_modules: '.*wq|.*wk|.*wv|.*wo|.*w1|.*w2|.*w3'
       freeze_exclude: ["*wte*", "*lm_head*"] # 使用chat权重进行微调时删除该配置
    

微调成功:

img

Q&A

  • 报错 ValueError x.shape and y.shape need to broadcast,完整报错信息如下
...
[INFO] 2024-07-16 13:52:49,028 [mindformers/trainer/base_trainer.py:682] training_process: .........Build Running Wrapper From Config For Train..........
[INFO] 2024-07-16 13:52:49,028 [mindformers/trainer/base_trainer.py:500] create_model_wrapper: .........Build Model Wrapper for Train From Config..........
[INFO] 2024-07-16 13:52:49,040 [mindformers/trainer/base_trainer.py:689] training_process: .........Build Callbacks For Train..........
[INFO] 2024-07-16 13:52:49,042 [mindformers/core/callback/callback.py:530] __init__: Integrated_save is changed to False when using auto_parallel.
[INFO] 2024-07-16 13:52:49,043 [mindformers/trainer/base_trainer.py:724] training_process: .........Starting Init Train Model..........
[INFO] 2024-07-16 13:52:49,043 [mindformers/trainer/utils.py:321] transform_and_load_checkpoint: .........Building model.........
[ERROR] 2024-07-16 14:16:46,150 [mindformers/tools/cloud_adapter/cloud_monitor.py:43] wrapper: Traceback (most recent call last):
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
    result = run_func(*args, **kwargs)
  File "/data/mindformers/research/qwen/run_qwen.py", line 137, in main
    trainer.finetune(finetune_checkpoint=ckpt, auto_trans_ckpt=auto_trans_ckpt)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/_checkparam.py", line 1313, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/trainer.py", line 485, in finetune
    self.trainer.train(
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 97, in train
    self.training_process(
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/base_trainer.py", line 739, in training_process
    transform_and_load_checkpoint(config, model, network, dataset)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/utils.py", line 322, in transform_and_load_checkpoint
    build_model(config, model, dataset, do_eval=do_eval, do_predict=do_predict)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/utils.py", line 446, in build_model
    model.build(train_dataset=dataset, epoch=config.runner_config.epochs,
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1274, in build
    self._init(train_dataset, valid_dataset, sink_size, epoch)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 529, in _init
    train_network.compile(*inputs)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 997, in compile
    _cell_graph_executor.compile(self, phase=self.phase,
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1547, in compile
    result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 647, in __infer__
    out[track] = fn(*(x[track] for x in args))
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/ops/operations/math_ops.py", line 80, in infer_shape
    return get_broadcast_shape(x_shape, y_shape, self.name)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/ops/_utils/utils.py", line 70, in get_broadcast_shape
    raise ValueError(f"For '{prim_name}', {arg_name1}.shape and {arg_name2}.shape need to "
ValueError: For 'Mul', x.shape and y.shape need to broadcast. The value of x.shape[-1] or y.shape[-1] must be 1 or -1 when they are not the same, but got x.shape = [8, 1, 1024] and y.shape = [1, 2048, 2048].

解决方法:确保微调所用的yaml的model_config.seq_length与STEP 3中数据转换成mindrecords的seq_length一致,像上面的报错就是源于一个设为1024,一个设为2048;

  • dst_strategy_path = local_strategy_paths[0]报错IndexError: list index out of range

img

...
[INFO] 2024-07-16 10:52:20,510 [mindformers/trainer/base_trainer.py:682] training_process: .........Build Running Wrapper From Config For Train..........
[INFO] 2024-07-16 10:52:20,510 [mindformers/trainer/base_trainer.py:500] create_model_wrapper: .........Build Model Wrapper for Train From Config..........
[INFO] 2024-07-16 10:52:20,523 [mindformers/trainer/base_trainer.py:689] training_process: .........Build Callbacks For Train..........
[INFO] 2024-07-16 10:52:20,525 [mindformers/trainer/base_trainer.py:724] training_process: .........Starting Init Train Model..........
[INFO] 2024-07-16 10:52:20,527 [mindformers/trainer/utils.py:334] transform_and_load_checkpoint: /data/qwen_ft/output is_share_disk: False
[INFO] 2024-07-16 10:52:20,527 [mindformers/trainer/utils.py:335] transform_and_load_checkpoint: world_size: 8
[INFO] 2024-07-16 10:52:20,528 [mindformers/trainer/utils.py:516] get_src_and_dst_strategy: .........Collecting strategy.........
[ERROR] 2024-07-16 10:52:20,530 [mindformers/tools/cloud_adapter/cloud_monitor.py:43] wrapper: Traceback (most recent call last):
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
    result = run_func(*args, **kwargs)
  File "/data/qwen_ft/qwen/run_qwen.py", line 137, in main
    trainer.finetune(finetune_checkpoint=ckpt, auto_trans_ckpt=auto_trans_ckpt)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindspore/_checkparam.py", line 1313, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/trainer.py", line 485, in finetune
    self.trainer.train(
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 97, in train
    self.training_process(
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/base_trainer.py", line 739, in training_process
    transform_and_load_checkpoint(config, model, network, dataset)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/utils.py", line 338, in transform_and_load_checkpoint
    src_ckpt_strategy, dst_ckpt_strategy = get_src_and_dst_strategy(config)
  File "/root/miniconda3/envs/mindspore2.2.13_py39/lib/python3.9/site-packages/mindformers/trainer/utils.py", line 522, in get_src_and_dst_strategy
    dst_strategy_path = local_strategy_paths[0]
IndexError: list index out of range

这个问题产生的过程是,当我们使用完整权重(STEP 2下载的qwen_7b_base.ckpt),且微调的yaml文件配置了 auto_trans_ckpt=True 时,脚本会自动启动权重转换,将完整权重转换为分布在8张卡上训练的分布式权重,并生成8卡的策略文件。在这个过程中如果没有在目的地防止相应的权重文件,或者权重文件本身有损的情况下,程序没有按照期待的方式进行切分、生成策略文件,导致 local_strategy_paths 目录下没有相应格式的文件甚至是空的,就报了这个错误。可能的原因和解决方法如下:

  1. 检查权重是否按照model_dir/rank_0/xxx.ckpt格式存放,存放路径不正确可能导致无法进行策略文件生成;
  2. 检查权重是否有损坏,建议重新按照STEP 2 下载。

标签:trainer,13,7b,Mindformers,mindformers,--,py,qwen
From: https://www.cnblogs.com/tungsten106/p/18395937

相关文章

  • 阿里重磅开源Qwen2-VL:超越人类的视觉理解能力,从医学影像到手写识别,这款开源多模态大模
    阿里重磅开源Qwen2-VL:超越人类的视觉理解能力,从医学影像到手写识别,这款开源多模态大模型究竟有多强?(附本地化部署教程)模型介绍最近呢,阿里巴巴开源了Qwen2-VL,在多模态大模型展现了在实际应用中的巨大潜力,尤其是在处理跨模态数据方面表现出众。以下是该模型的几大应用亮点:智......
  • 使用docker部署tensorrtllm推理大模型baichuan2-7b
    简介大模型的推理框架,我之前用过vllm和mindie。近期有项目要用tensorrtllm,这里将摸索的过程记录下,特别是遇到的问题。我的环境是Linux+rt3090准备docker环境本次使用docker镜像部署,需要从网上拉取:dockerpullnvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3Th......
  • 游戏启动失败?解决0xc000007b错误全攻略
    许多玩家在尝试运行某些游戏时可能会遇到一个常见的错误代码“0xc000007b”,这个错误通常表示系统缺少必要的DLL文件或存在版本不兼容的问题。本文将详细介绍这一错误的原因及解决方法,帮助玩家顺利进入游戏世界。错误0xc000007b的原因系统架构不匹配:游戏要求的系统架构(如32位......
  • 开源模型应用落地-qwen2-7b-instruct-LoRA微调&合并-ms-swift-单机多卡-RTX 4090双卡(
    一、前言  本篇文章将使用ms-swift去合并微调后的模型权重,通过阅读本文,您将能够更好地掌握这些关键技术,理解其中的关键技术要点,并应用于自己的项目中。二、术语介绍2.1.LoRA微调  LoRA(Low-RankAdaptation)用于微调大型语言模型(LLM)。 是一种有效的自适应......
  • 英伟达™(NVIDIA®)发布 NVEagle: 超级震撼的视觉语言模型,7B、13B 和 13B 聊天微调版
    多模态大型语言模型(MLLM)是人工智能领域的一次重大飞跃,它将视觉信息和语言信息结合起来,能够更好地理解和解释复杂的现实世界场景。这些模型旨在观察、理解和推理视觉输入,使其在光学字符识别(OCR)和文档分析任务中发挥无价之宝的作用。这些MLLM的核心在于它们的视觉编码器,可......
  • 怎么在Windows操作系统部署阿里开源版通义千问(Qwen2)
    怎么在Windows操作系统部署阿里开源版通义千问(Qwen2) | 原创作者/编辑:凯哥Java              | 分类:人工智能学习系列教程GitHub上qwen2截图随着人工智能技术的不断进步,阿里巴巴通义千问团队近期发布了Qwen2系列开源模型,这一系列模型在多个领......
  • 怎么在Windows操作系统部署阿里开源版通义千问(Qwen2)
    怎么在Windows操作系统部署阿里开源版通义千问(Qwen2) | 原创作者/编辑:凯哥Java              | 分类:人工智能学习系列教程GitHub上qwen2截图随着人工智能技术的不断进步,阿里巴巴通义千问团队近期发布了Qwen2系列开源模型,这一系列模型在多个领域展......
  • 阿里云Qwen2-VL语言模型:特点与实用性解析
    最近,阿里云推出了最新的视觉语言模型——Qwen2-VL。作为一款先进的视觉语言模型,Qwen2-VL的发布无疑为AI领域注入了新的活力。那么,这款模型有哪些特别之处?它的实用性又如何呢?今天我们就来详细解析一下Qwen2-VL的特点与实际应用。一、Qwen2-VL的核心特点1.多分辨率与比例图......
  • 阿里巴巴发布 Qwen2-VL 人工智能模型,具备先进的视频分析和推理能力
    中国阿里巴巴集团的云计算部门阿里云周四宣布推出一款名为Qwen2-VL的新型人工智能模型,该模型具有高级视觉理解能力和多语言对话能力。该公司在Qwen-VL人工智能模型的基础上,历时一年研发出了新模型,并表示它可以实现对长度超过20分钟的高质量视频的理解。据阿里巴巴......
  • 阿里重磅开源超强AI模型Qwen2-VL:能理解超 20 分钟视频!
    炸裂!阿里巴巴的云计算部门刚刚发布了一款全新的AI模型——Qwen2-VL,而且一口气发布了20亿参数和70亿参数两个版本,还开放了最强720亿参数版本的API!小伙伴们可能要问了,这个Qwen2-VL到底有多厉害?01、Qwen2-VL有多厉害?·看得清,看得懂: Qwen2-VL在各种视觉理解任务上都取......