首页 > 编程问答 >本地训练 Llama-2-7b-hf 模型时 CUDA 内存不足

本地训练 Llama-2-7b-hf 模型时 CUDA 内存不足

时间:2024-07-22 10:33:22浏览次数:16  
标签:python pytorch artificial-intelligence huggingface-transformers

我想在我的笔记本电脑上本地微调meta-llama/Llama-2-7b-hf。实例化 Trainer 类时,我的 CUDA 内存不足。我有 16Gb 系统 RAM 和带有 6Gb GPU 内存的 GTX 1060。我已将模型层拆分为 CPU 和 GPU,以防止 GPU 填满并使用小批量大小。然而,当实例化训练器类时,代码再次尝试填充 GPU,但内存不足。我厌倦了 Chat-GPT 的一些建议并使用了自定义培训器,但仍然不起作用。是否可以在本地训练“meta-llama/Llama-2-7b-hf”模型?下面我缺少什么?

import pandas as pd
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, BitsAndBytesConfig, AutoConfig
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import DefaultDataCollator

from huggingface_hub import login
login(token='hf_abcdefghijklmnopqrstuvwxyz')

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:64'
os.environ['TRANSFORMERS_CACHE'] = '~/.cache/huggingface/transformers/'


##########################
class CustomTrainer(Trainer):
    def training_step(self, model, inputs):
        model.train()
        inputs = self._prepare_inputs(inputs)
        outputs = model(**inputs)
        loss = outputs.loss
        return loss

    def evaluation_step(self, model, inputs):
        model.eval()
        inputs = self._prepare_inputs(inputs)
        with torch.no_grad():
            outputs = model(**inputs)
        return outputs

    def _prepare_inputs(self, inputs):
        for k, v in inputs.items():
            if isinstance(v, torch.Tensor):
                device = gpu if 'input_ids' in k else cpu  # Adjust based on your logic
                inputs[k] = v.to(device)
        return inputs
##########################       
    
# Load your dataset
def load_data(train_path, val_path):
    train_df = pd.read_csv(train_path)
    val_df = pd.read_csv(val_path)

    train_dataset = Dataset.from_pandas(train_df)
    val_dataset = Dataset.from_pandas(val_df)

    return train_dataset, val_dataset

# Preprocess the data
def preprocess_function(examples, tokenizer, max_length=128):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)

# Main function to fine-tune the model
def fine_tune_model(model_name, train_dataset, val_dataset, output_dir, num_labels, local_model_path, num_train_epochs=3):

    # Load LLaMA tokenizer
    tokenizer = AutoTokenizer.from_pretrained(local_model_path)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

    #tokenizer = AutoTokenizer.from_pretrained(model_name)
    #if tokenizer.pad_token is None:
    #    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    
    tokenized_train = train_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
    tokenized_val = val_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)

    quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
    device_map = {
        "transformer.wte": "cpu",
        "transformer.h": "cuda",
        "transformer.ln_f": "cpu",
        'model.embed_tokens': 'cuda',
        'model.encoder': 'cpu',
        'model.decoder': 'cuda',
        "lm_head": "cpu",
        "model.layers": "cuda",
        "model.norm": "cuda",
        "score": "cuda"
    }
    
    config = AutoConfig.from_pretrained(local_model_path)
    config.num_labels = num_labels
    config.device_map=device_map
    config.load_in_8bit=True
    config.llm_int8_enable_fp32_cpu_offload=True

    model = AutoModelForSequenceClassification.from_pretrained(local_model_path, config=config)
    
    
    for param in model.parameters():
        param.requires_grad = False
    
    # Add LoRA adapters for efficient fine-tuning
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
        bias="none",
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()  # Check the number of trainable parameters
    print(model)

    cpu = torch.device("cpu")
    gpu = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    for idx, layer in enumerate(model.base_model.model.model.layers):
        if idx % 5 == 0:
            layer.to(gpu)  # Move few layers to GPU
        else:
            layer.to(cpu)  # Move more layers to CPU

    # Verify device allocation
    for idx, layer in enumerate(model.base_model.model.model.layers[:10]):  # Checking the first 10 layers as an example
        print(f"Layer {idx} is on {next(layer.parameters()).device}")
        
    class CustomDataCollator(DefaultDataCollator):
        def __call__(self, features):
            batch = super().__call__(features)
            for key in batch:
                if isinstance(batch[key], torch.Tensor):
                    batch[key] = batch[key].to(gpu if 'input_ids' in key else cpu)  # Adjust based on your logic
            return batch
        
    data_collator = CustomDataCollator()


    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=num_train_epochs,
        weight_decay=0.01,
    )
   

    trainer = CustomTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_val,
        tokenizer=tokenizer,
        data_collator = data_collator
    )
    
    trainer.train()

if __name__ == "__main__":
    train_path = "workspace/learn_AI/llm_demo/train.csv"
    val_path = "workspace/learn_AI/llm_demo/validate.csv"
    model_name = "meta-llama/Llama-2-7b-hf"  
    local_model_path = str("local_directory/" + model_name)
    output_dir = "./results"
    num_labels = 2  # Number of classes in your classification task

    train_dataset, val_dataset = load_data(train_path, val_path)
    fine_tune_model(model_name, train_dataset, val_dataset, output_dir, num_labels, local_model_path)

我从链接 https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main 下载了所有模型文件到本地文件夹。

train.csv

text,labels
"I absolutely love this product! It works perfectly.",1
"The quality of this item is outstanding, highly recommend it.",1
"This is the worst purchase I've ever made. Completely useless.",0
"I am very dissatisfied with this product. It broke after one use.",0
"Fantastic experience! Will definitely buy again.",1
"Terrible customer service. I will not be buying from this store again.",0
"The delivery was quick and the product was as described. Very happy.",1
"The product arrived damaged and the return process was a nightmare.",0
"This item is okay, but it doesn't meet my expectations.",0
"Excellent quality and easy to use. Very satisfied.",1
"Awful experience, the product doesn't work as advertised.",0
"I love the design and functionality of this product. Great buy.",1
"Not worth the money. Very disappointed.",0
"Superb! Exactly what I was looking for.",1
"The product is decent, but the packaging was poor.",0
"Very pleased with my purchase. Highly recommend.",1
"Product stopped working after a week. Very frustrated.",0
"Best purchase I've made in a long time. Extremely happy.",1
"Would not recommend this product. It's a waste of money.",0
"Overall, a good product. Satisfied with my purchase.",1

validate.csv

text,labels
"This is a fantastic product. Exceeded my expectations.",1
"Very poor quality, not worth the money.",0
"Absolutely love it! Will recommend to friends.",1
"Completely dissatisfied with this item.",0
"Great value for the price. Happy with the purchase.",1
"The product did not work as expected. Disappointed.",0
"Wonderful experience, product arrived on time and works perfectly.",1
"Terrible quality. It broke within a few days.",0
"Satisfied with my purchase. Would buy again.",1
"Not what I expected. Very disappointed.",0
"Excellent product, great customer service.",1
"Regret buying this product. Total waste of money.",0
"Product is as described and works well. Happy customer.",1
"Product arrived late and was damaged. Very unhappy.",0
"Exceeded my expectations in every way. Fantastic!",1
"Low quality and poor performance. Do not recommend.",0
"Very pleased with the product. Will purchase again.",1
"Did not meet my expectations. Unsatisfied.",0
"Best product I've bought recently. Highly recommend.",1
"Terrible experience. Will not buy from this brand again.",0

在尝试在资源有限的笔记本电脑上微调 Llama-2-7b-hf 时遇到 CUDA 内存不足错误是正确的。即使有策略性地将层卸载到 CPU,该模型及其激活对于的 6GB GPU 内存来说也太大了。

但是,让我们探索一些可以使这项工作有效的选项:

  1. 降低批次大小: 已经在这样做,但值得一提的是,这是最直接影响内存使用量的方法。尝试以 1 或 2 的批次大小进行实验。

  2. 梯度累积: 此技术模拟了更大的批次大小,而无需一次性将所有内容都装入内存中。它会在多个步骤中累积梯度,然后更新模型权重。可以使用 Transformers Trainer 中的 gradient_accumulation_steps 参数来实现。

python training_args = TrainingArguments( ..., gradient_accumulation_steps=2, # 或 4,或 8,具体取决于的内存限制 ... )

  1. 梯度检查点: 此方法通过在训练期间仅存储计算图的子集来以增加计算时间为代价来节省内存。可以通过在 TrainingArguments 中设置 gradient_checkpointing=True 来启用它。

python training_args = TrainingArguments( ..., gradient_checkpointing=True, ... )

  1. 模型量化: 量化通过使用较低精度的数据类型(如 bfloat16 )表示模型权重和激活来减少内存占用。已经在使用 BitsAndBytesConfig ,但请确保尝试了不同的量化设置。

  2. 使用更小的模型: 如果有限制,请考虑使用 Llama-2-7b-hf 的较小版本,例如 Llama-2-13b-hf Llama-2-70b-hf 。这些模型需要的内存更少,并且仍然可以提供良好的性能。

  3. 云计算: 对于大型语言模型的认真微调,云计算平台(如 Google Colab、AWS 或 Paperspace)提供具有大量 RAM 和功能更强大的 GPU 的实例,使能够有效地训练模型。

对代码的附加说明:

  • 的自定义 CustomTrainer CustomDataCollator 看起来不错,并应确保将正确的张量发送到正确的设备。

  • 将层分配给特定设备(CPU 或 GPU)的逻辑似乎是合理的。但是,确保的逻辑与的特定硬件和模型架构保持一致。

  • 持续监控的内存使用情况,以了解哪些组件消耗的内存最多,并相应地调整的策略。

请记住,在资源有限的设备上微调大型语言模型可能具有挑战性且耗时。尝试不同的技术并找出最适合特定设置的方法至关重要。

标签:python,pytorch,artificial-intelligence,huggingface-transformers
From: 78776756

相关文章

  • Python学习计划——2.4列表推导式(List Comprehensions)
    列表推导式是Python的一种简洁且强大的语法,用于生成新的列表。它可以用更少的代码、更清晰的方式来创建列表,特别是在处理简单的循环和条件操作时。1.基本语法列表推导式的基本语法如下:[expressionforiteminiterable]expression:表达式,计算结果用于生成列表的元素。ite......
  • Python学习计划——2.3常用内置函数(len, max, min, sum, etc.)
    Python提供了许多内置函数,用于简化对数据结构的操作。以下是一些常用的内置函数及其详细说明。1.len()len()函数用于返回对象(如列表、元组、字符串、字典等)的长度(元素个数)。示例:#列表fruits=["apple","banana","cherry"]print(len(fruits))#输出:3#元组c......
  • 哪个 Python 框架可以在 Google Collab 中显示和更改图像?
    我希望能够在使用GoogleCollab时为RL绘制高fps的位图。我现在可以使用OpenCV绘制图像cv2_imshowgoogle替换cv2.imshow但是,它无法替换现有图像,它下面绘制了新的我能够在替换imshow函数中使用一些JavaScript来修复它。但刷新率约为......
  • VSCode 自动建议 python 导入而不依赖 Intellisense
    我正在使用Transformer中的AutoModel之类的对象,并且经常遇到自动导入建议无法找到的对象。我总是希望VSCode建议“从Transformer中执行”,而不是费心寻找它找不到的原因每当看到未定义的“AutoModel”时,都会导入AutoModel”,因此无需扫描任何python导入目录。这......
  • 如何使用Python计算位移自相关函数?
    我正在使用python来分析粒子的异常扩散。我已经得到了粒子轨迹的位移,我想计算并绘制位移自相关与滞后时间t的关系。我认为可能存在使用t和位移(如deltar)的自相关函数的一般函数,但我不能没找到。我可以得到函数或代码吗?可以使用numpy和matplotlib库在Python......
  • 一天一点点,第四天Python基础
    第一天:一天一点点。Python基础-CSDN博客第二天:一天一点点,接上章Python基础-CSDN博客第三天:一天一点点,第三天Python基础(循环语句)-CSDN博客推导式推导式是一种独特的数据处理方式,可以从一个数据序列构建另一个新的数据序列的结构体。推导式是一种强大且简洁的语法,适用于生......
  • Python - for循环不使用正则表达式附加数组
    以下代码从URL获取版本号,然后对于每个版本号,转到该版本号的页面并使用文件名的特定模式填充数组。生成的数组应包含每个版本号的文件名列表,但它似乎只包含早期版本(2.6)。使用print语句,我可以看到代码的工作原理是它获取sha256sums.asc文件-所有这些文件,所有版本。我猜......
  • 使用 callable_iterator (re.finditer) 导致 Python 冻结
    我有一个为文本的每一行调用的函数。deftokenize_line(line:str,cmd=''):matches=re.finditer(Patterns.SUPPORTED_TOKENS,line)tokens_found,not_found,start_idx=[],[],0print(matches)formatchinmatches:pass#Rest......
  • Python 的 time.sleep - 永远不会醒来
    我认为这将是那些简单的问题之一,但它让我感到困惑。[停止媒体:我是对的。找到了解决方案。查看答案。]我正在使用Python的单元测试框架来测试多线程应用程序。很好而且很直接-我有5个左右的工作线程监视一个公共队列,以及一个为它们制作工作项的生产者线程......
  • python中使用mitmproxy的http模块出错
    我有一个使用mitmproxyhttp函数的代码,它在这里惨败:defmain(stdscr):try:parser=argparse.ArgumentParser(description='NetSourNetworkAnalyzer')parser.add_argument('--proxy',action='store_true',help='EnableH......