我想在我的笔记本电脑上本地微调meta-llama/Llama-2-7b-hf。实例化 Trainer 类时,我的 CUDA 内存不足。我有 16Gb 系统 RAM 和带有 6Gb GPU 内存的 GTX 1060。我已将模型层拆分为 CPU 和 GPU,以防止 GPU 填满并使用小批量大小。然而,当实例化训练器类时,代码再次尝试填充 GPU,但内存不足。我厌倦了 Chat-GPT 的一些建议并使用了自定义培训器,但仍然不起作用。是否可以在本地训练“meta-llama/Llama-2-7b-hf”模型?下面我缺少什么?
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, BitsAndBytesConfig, AutoConfig
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import DefaultDataCollator
from huggingface_hub import login
login(token='hf_abcdefghijklmnopqrstuvwxyz')
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:64'
os.environ['TRANSFORMERS_CACHE'] = '~/.cache/huggingface/transformers/'
##########################
class CustomTrainer(Trainer):
def training_step(self, model, inputs):
model.train()
inputs = self._prepare_inputs(inputs)
outputs = model(**inputs)
loss = outputs.loss
return loss
def evaluation_step(self, model, inputs):
model.eval()
inputs = self._prepare_inputs(inputs)
with torch.no_grad():
outputs = model(**inputs)
return outputs
def _prepare_inputs(self, inputs):
for k, v in inputs.items():
if isinstance(v, torch.Tensor):
device = gpu if 'input_ids' in k else cpu # Adjust based on your logic
inputs[k] = v.to(device)
return inputs
##########################
# Load your dataset
def load_data(train_path, val_path):
train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
return train_dataset, val_dataset
# Preprocess the data
def preprocess_function(examples, tokenizer, max_length=128):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)
# Main function to fine-tune the model
def fine_tune_model(model_name, train_dataset, val_dataset, output_dir, num_labels, local_model_path, num_train_epochs=3):
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
#tokenizer = AutoTokenizer.from_pretrained(model_name)
#if tokenizer.pad_token is None:
# tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenized_train = train_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
tokenized_val = val_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
device_map = {
"transformer.wte": "cpu",
"transformer.h": "cuda",
"transformer.ln_f": "cpu",
'model.embed_tokens': 'cuda',
'model.encoder': 'cpu',
'model.decoder': 'cuda',
"lm_head": "cpu",
"model.layers": "cuda",
"model.norm": "cuda",
"score": "cuda"
}
config = AutoConfig.from_pretrained(local_model_path)
config.num_labels = num_labels
config.device_map=device_map
config.load_in_8bit=True
config.llm_int8_enable_fp32_cpu_offload=True
model = AutoModelForSequenceClassification.from_pretrained(local_model_path, config=config)
for param in model.parameters():
param.requires_grad = False
# Add LoRA adapters for efficient fine-tuning
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
r=8,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters() # Check the number of trainable parameters
print(model)
cpu = torch.device("cpu")
gpu = torch.device("cuda" if torch.cuda.is_available() else "cpu")
for idx, layer in enumerate(model.base_model.model.model.layers):
if idx % 5 == 0:
layer.to(gpu) # Move few layers to GPU
else:
layer.to(cpu) # Move more layers to CPU
# Verify device allocation
for idx, layer in enumerate(model.base_model.model.model.layers[:10]): # Checking the first 10 layers as an example
print(f"Layer {idx} is on {next(layer.parameters()).device}")
class CustomDataCollator(DefaultDataCollator):
def __call__(self, features):
batch = super().__call__(features)
for key in batch:
if isinstance(batch[key], torch.Tensor):
batch[key] = batch[key].to(gpu if 'input_ids' in key else cpu) # Adjust based on your logic
return batch
data_collator = CustomDataCollator()
training_args = TrainingArguments(
output_dir=output_dir,
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=num_train_epochs,
weight_decay=0.01,
)
trainer = CustomTrainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
tokenizer=tokenizer,
data_collator = data_collator
)
trainer.train()
if __name__ == "__main__":
train_path = "workspace/learn_AI/llm_demo/train.csv"
val_path = "workspace/learn_AI/llm_demo/validate.csv"
model_name = "meta-llama/Llama-2-7b-hf"
local_model_path = str("local_directory/" + model_name)
output_dir = "./results"
num_labels = 2 # Number of classes in your classification task
train_dataset, val_dataset = load_data(train_path, val_path)
fine_tune_model(model_name, train_dataset, val_dataset, output_dir, num_labels, local_model_path)
我从链接 https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main 下载了所有模型文件到本地文件夹。
train.csv
text,labels
"I absolutely love this product! It works perfectly.",1
"The quality of this item is outstanding, highly recommend it.",1
"This is the worst purchase I've ever made. Completely useless.",0
"I am very dissatisfied with this product. It broke after one use.",0
"Fantastic experience! Will definitely buy again.",1
"Terrible customer service. I will not be buying from this store again.",0
"The delivery was quick and the product was as described. Very happy.",1
"The product arrived damaged and the return process was a nightmare.",0
"This item is okay, but it doesn't meet my expectations.",0
"Excellent quality and easy to use. Very satisfied.",1
"Awful experience, the product doesn't work as advertised.",0
"I love the design and functionality of this product. Great buy.",1
"Not worth the money. Very disappointed.",0
"Superb! Exactly what I was looking for.",1
"The product is decent, but the packaging was poor.",0
"Very pleased with my purchase. Highly recommend.",1
"Product stopped working after a week. Very frustrated.",0
"Best purchase I've made in a long time. Extremely happy.",1
"Would not recommend this product. It's a waste of money.",0
"Overall, a good product. Satisfied with my purchase.",1
validate.csv
text,labels
"This is a fantastic product. Exceeded my expectations.",1
"Very poor quality, not worth the money.",0
"Absolutely love it! Will recommend to friends.",1
"Completely dissatisfied with this item.",0
"Great value for the price. Happy with the purchase.",1
"The product did not work as expected. Disappointed.",0
"Wonderful experience, product arrived on time and works perfectly.",1
"Terrible quality. It broke within a few days.",0
"Satisfied with my purchase. Would buy again.",1
"Not what I expected. Very disappointed.",0
"Excellent product, great customer service.",1
"Regret buying this product. Total waste of money.",0
"Product is as described and works well. Happy customer.",1
"Product arrived late and was damaged. Very unhappy.",0
"Exceeded my expectations in every way. Fantastic!",1
"Low quality and poor performance. Do not recommend.",0
"Very pleased with the product. Will purchase again.",1
"Did not meet my expectations. Unsatisfied.",0
"Best product I've bought recently. Highly recommend.",1
"Terrible experience. Will not buy from this brand again.",0
在尝试在资源有限的笔记本电脑上微调 Llama-2-7b-hf 时遇到 CUDA 内存不足错误是正确的。即使有策略性地将层卸载到 CPU,该模型及其激活对于的 6GB GPU 内存来说也太大了。
但是,让我们探索一些可以使这项工作有效的选项:
-
降低批次大小: 已经在这样做,但值得一提的是,这是最直接影响内存使用量的方法。尝试以 1 或 2 的批次大小进行实验。
-
梯度累积: 此技术模拟了更大的批次大小,而无需一次性将所有内容都装入内存中。它会在多个步骤中累积梯度,然后更新模型权重。可以使用 Transformers Trainer 中的
gradient_accumulation_steps
参数来实现。
python
training_args = TrainingArguments(
...,
gradient_accumulation_steps=2, # 或 4,或 8,具体取决于的内存限制
...
)
-
梯度检查点:
此方法通过在训练期间仅存储计算图的子集来以增加计算时间为代价来节省内存。可以通过在
TrainingArguments
中设置gradient_checkpointing=True
来启用它。
python
training_args = TrainingArguments(
...,
gradient_checkpointing=True,
...
)
-
模型量化: 量化通过使用较低精度的数据类型(如
bfloat16
)表示模型权重和激活来减少内存占用。已经在使用BitsAndBytesConfig
,但请确保尝试了不同的量化设置。 -
使用更小的模型: 如果有限制,请考虑使用
Llama-2-7b-hf
的较小版本,例如Llama-2-13b-hf
或Llama-2-70b-hf
。这些模型需要的内存更少,并且仍然可以提供良好的性能。 -
云计算: 对于大型语言模型的认真微调,云计算平台(如 Google Colab、AWS 或 Paperspace)提供具有大量 RAM 和功能更强大的 GPU 的实例,使能够有效地训练模型。
对代码的附加说明:
-
的自定义
CustomTrainer
和CustomDataCollator
看起来不错,并应确保将正确的张量发送到正确的设备。 -
将层分配给特定设备(CPU 或 GPU)的逻辑似乎是合理的。但是,确保的逻辑与的特定硬件和模型架构保持一致。
-
持续监控的内存使用情况,以了解哪些组件消耗的内存最多,并相应地调整的策略。
请记住,在资源有限的设备上微调大型语言模型可能具有挑战性且耗时。尝试不同的技术并找出最适合特定设置的方法至关重要。
标签:python,pytorch,artificial-intelligence,huggingface-transformers From: 78776756