第四期书生大模型实战营基础岛第5关

标签：实战第四期 text finetune 书生 xtuner file root 模型

XTuner 微调个人小助手认知任务

注：微调内容需要使用 30% A100 才能完成！！！

环境配置与数据准备

1. 使用 conda 先构建一个 Python-3.10 的虚拟环境，并激活

ps：若root中存在Tutorial文件夹记得先删除哦，不然之后在“创建一个用于存储微调数据的文件夹”时有出现问题的TAT。

cd ~
#git clone 本repo
git clone https://kkgithub.com/InternLM/Tutorial.git -b camp4
mkdir -p /root/finetune && cd /root/finetune
conda create -n xtuner-env python=3.10 -y
conda activate xtuner-env

2. 安装 XTuner等相关库文件

git clone https://kkgithub.com/InternLM/xtuner.git
cd /root/finetune/xtuner

pip install  -e '.[all]'
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.39.0

安装时间有点长，需要耐心等待一下哦。

3.打印配置文件

xtuner list-cfg

若打印的内容下图所示，说明安装成功。

修改提供的数据

1. 创建一个用于存储微调数据的文件夹

mkdir -p /root/finetune/data && cd /root/finetune/data
cp -r /root/Tutorial/data/assistant_Tuner.jsonl  /root/finetune/data

此时 `finetune` 文件夹下应该有如下结构

finetune
├── data
│   └── assistant_Tuner.jsonl
└── xtuner

2. 创建 `change_script.py` 文件

# 创建 `change_script.py` 文件
touch /root/finetune/data/change_script.py

3.打开该`change_script.py`文件后将下面的内容复制进去。

import json
import argparse
from tqdm import tqdm

def process_line(line, old_text, new_text):
    # 解析 JSON 行
    data = json.loads(line)
    
    # 递归函数来处理嵌套的字典和列表
    def replace_text(obj):
        if isinstance(obj, dict):
            return {k: replace_text(v) for k, v in obj.items()}
        elif isinstance(obj, list):
            return [replace_text(item) for item in obj]
        elif isinstance(obj, str):
            return obj.replace(old_text, new_text)
        else:
            return obj
    
    # 处理整个 JSON 对象
    processed_data = replace_text(data)
    
    # 将处理后的对象转回 JSON 字符串
    return json.dumps(processed_data, ensure_ascii=False)

def main(input_file, output_file, old_text, new_text):
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        # 计算总行数用于进度条
        total_lines = sum(1 for _ in infile)
        infile.seek(0)  # 重置文件指针到开头
        
        # 使用 tqdm 创建进度条
        for line in tqdm(infile, total=total_lines, desc="Processing"):
            processed_line = process_line(line.strip(), old_text, new_text)
            outfile.write(processed_line + '\n')

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Replace text in a JSONL file.")
    parser.add_argument("input_file", help="Input JSONL file to process")
    parser.add_argument("output_file", help="Output file for processed JSONL")
    parser.add_argument("--old_text", default="尖米", help="Text to be replaced")
    parser.add_argument("--new_text", default="闻星", help="Text to replace with")
    args = parser.parse_args()

    main(args.input_file, args.output_file, args.old_text, args.new_text)

4.修改第44行的代码

#原代码：
	parser.add_argument("--new_text", default="闻星", help="Text to replace with")
#修改后的代码
    parser.add_argument("--new_text", default="你的名字", help="Text to replace with")

5.运行文件

# usage：python change_script.py {input_file.jsonl} {output_file.jsonl}
cd ~/finetune/data
python change_script.py ./assistant_Tuner.jsonl ./assistant_Tuner_change.jsonl

6.查看数据

cat assistant_Tuner_change.jsonl | head -n 3

此处结果太长不再展示，主要是检查自己要修改的名字是否在数据中。

训练启动

1.复制模型

mkdir /root/finetune/models

ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/finetune/models/internlm2_5-7b-chat

2. 修改 Config

# cd {path/to/finetune}
cd /root/finetune
mkdir ./config
cd config
xtuner copy-cfg internlm2_5_chat_7b_qlora_alpaca_e3 ./

3. 启动微调

完成了所有的准备工作后，我们就可以正式的开始我们下一阶段的旅程：XTuner 启动~！

当我们准备好了所有内容，我们只需要将使用 xtuner train 命令令即可开始训练。

cd /root/finetune
conda activate xtuner-env

xtuner train ./config/internlm2_5_chat_7b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero2 --work-dir ./work_dirs/assistTuner

部分训练过程如下图所示：

4. 权重转换

模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件，那么我们可以通过以下命令来实现一键转换。

cd /root/finetune/work_dirs/assistTuner

conda activate xtuner-env

# 先获取最后保存的一个pth文件
pth_file=`ls -t /root/finetune/work_dirs/assistTuner/*.pth | head -n 1 | sed 's/:$//'`
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm2_5_chat_7b_qlora_alpaca_e3_copy.py ${pth_file} ./hf

转换完成之后，hf文件结构如图：

5. 模型合并

对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型，而是一个额外的层（Adapter），训练完的这个层最终还是要与原模型进行合并才能被正常的使用。

对于全量微调的模型（full）其实是不需要进行整合这一步的，因为全量微调修改的是原模型的权重而非微调一个新的 Adapter ，因此是不需要进行模型整合的。

在 XTuner 中提供了一键合并的命令 xtuner convert merge，在使用前我们需要准备好三个路径，包括原模型的路径、训练好的 Adapter 层的（模型格式转换后的）路径以及最终保存的路径。

xtuner convert merge命令用于合并模型。该命令需要三个参数：LLM 表示原模型路径，ADAPTER 表示 Adapter 层的路径， SAVE_PATH 表示合并后的模型最终的保存路径。

在模型合并这一步还有其他很多的可选参数，包括：

参数名	解释
--max-shard-size {GB}	代表每个权重文件最大的大小（默认为2GB）
--device {device_name}	这里指的就是device的名称，可选择的有cuda、cpu和auto，默认为cuda即使用gpu进行运算
--is-clip	这个参数主要用于确定模型是不是CLIP模型，假如是的话就要加上，不是就不需要添加

cd /root/finetune/work_dirs/assistTuner
conda activate xtuner-env

export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert merge /root/finetune/models/internlm2_5-7b-chat ./hf ./merged --max-shard-size 2GB

合并完成啦！

模型 WebUI 对话

微调完成后，我们可以再次运行 xtuner_streamlit_demo.py 脚本来观察微调后的对话效果，不过在运行之前，我们需要将脚本中的模型路径修改为微调后的模型的路径。

cd ~/Tutorial/tools/L1_XTuner_code

修改xtuner_streamlit_demo.py 文件第33行代码

# 直接修改脚本文件第33行
- model_name_or_path = "Shanghai_AI_Laboratory/internlm2_5-7b-chat"
+ model_name_or_path = "/root/finetune/work_dirs/assistTuner/merged"

然后，我们可以直接启动应用。

conda activate xtuner-env

streamlit run /root/Tutorial/tools/L1_XTuner_code/xtuner_streamlit_demo.py

接着，在本地远程连接开发机

最后，通过浏览器访问：http://127.0.0.1:8501 来进行对话啦~

标签：实战,第四期,text,finetune,书生,xtuner,file,root,模型
From： https://blog.csdn.net/2401_87331158/article/details/143996516

第四期书生大模型实战营基础岛第5关

XTuner 微调个人小助手认知任务

环境配置与数据准备

1. 使用 conda 先构建一个 Python-3.10 的虚拟环境，并激活

2. 安装 XTuner等相关库文件

3.打印配置文件

修改提供的数据

1. 创建一个用于存储微调数据的文件夹

2. 创建 `change_script.py` 文件

3.打开该`change_script.py`文件后将下面的内容复制进去。

4.修改第44行的代码

5.运行文件

6.查看数据

训练启动

1.复制模型

2. 修改 Config

3. 启动微调

4. 权重转换

5. 模型合并

模型 WebUI 对话

相关文章

赞助商

阅读排行

第四期书生大模型实战营 基础岛 第5关

XTuner 微调个人小助手认知任务

环境配置与数据准备

1. 使用 conda 先构建一个 Python-3.10 的虚拟环境，并激活

2. 安装 XTuner等 相关库文件

3.打印配置文件

修改提供的数据

1. 创建一个用于存储微调数据的文件夹

2. 创建 `change_script.py` 文件

3.打开该change_script.py文件后将下面的内容复制进去。

4.修改第44行的代码

5.运行文件

6.查看数据

训练启动

1.复制模型

2. 修改 Config

3. 启动微调

4. 权重转换

5. 模型合并

模型 WebUI 对话

相关文章

赞助商

阅读排行

第四期书生大模型实战营基础岛第5关

2. 安装 XTuner等相关库文件

3.打开该`change_script.py`文件后将下面的内容复制进去。