首页 > 其他分享 >【LLM实战】 基于QLoRA对微软Phi-2进行对话摘要任务微调

【LLM实战】 基于QLoRA对微软Phi-2进行对话摘要任务微调

时间:2024-03-13 22:47:30浏览次数:25  
标签:Phi tokenizer QLoRA max dataset length LLM model True

本文将在DialogSum数据集上使用2张T4卡对2.7B的microsoft/phi2进行LORA微调。

博客翻译自Kaggle项目 fine-tuning-llm-for-dialogue-summarization
https://www.kaggle.com/code/aisuko/fine-tuning-llm-for-dialogue-summarization

一、安装依赖

首先,安装依赖包

%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install evaluate==0.4.1
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
# !pip install tqdm==4.66.1

然后,在Kaggle中配置HuggingFace的token以及WeightAndBytes(便于监控模型过程)的token
image
点击Add-ons的Secrets

image

配置自己的相应token

import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("wandb_key")

# 创建一个Wandb项目,并在下面设立一个任务sft-microsoft-phi2-on-dialogsum
os.environ["WANDB_PROJECT"] = "Supervised-fine-tune-models"
os.environ["WANDB_NOTES"] = "Supervised fine tune models"
os.environ["WANDB_NAME"] = "sft-microsoft-phi2-on-dialogsum"
os.environ["MODEL_NAME"] = "microsoft/phi-2"
os.environ["DATASET_NAME"] = "neil-code/dialogsum-test"

结果

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful

估计微调显存需求

!accelerate estimate-memory microsoft/phi-2 --library_name transformers

image

二、准备数据

DialogSum是一个包含13,460个人工标注包含摘要和主题的对话摘要数据集。

2.1 dialogsum数据集

使用hugging-face的dataset库加载dialogsum数据集的子集

from datasets import load_dataset

dataset=load_dataset("neil-code/dialogsum-test")
dataset

训练:1999, 验证:499,测试:499

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1999
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
})

每个样本的字段:['id', 'dialogue', 'summary', 'topic']
样本的内容包含对话正文、摘要以及话题

print(dataset["train"][0])

查看一个样本

{'id': 'train_0', 
'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.", 
'summary': "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking.", 
'topic': 'get a check-up'}

可以看到,每轮对话是以#PersonX#开头(代表对话的角色),换行符\n结尾

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.

[可选项] 减少数据规模
为了在有限的计算资源上进行微调,可以减少数据的规模

smaller_training=dataset['train'].select(range(100))
smaller_validation=dataset['validation'].select(range(80))
smaller_test=dataset['test'].select(range(80))

dataset['train']=smaller_training
dataset['validation']=smaller_validation
dataset['test']=smaller_test

在这里,我将微调的训练集从1999降到了100,验证和测试为80

dataset

查看数据

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 80
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 80
    })
})

2.2 加载分词器tokenizer

from transformers import AutoTokenizer
# see https://github.com/huggingface/transformers/issues/18388 for description about padding
tokenizer=AutoTokenizer.from_pretrained(
    'microsoft/phi-2',
    padding_side='left',
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False
)
tokenizer.pad_token=tokenizer.eos_token

image

导入DataCollatorForLanguageModeling

使用序列结束token作为padding token,并将mlm设置为False。这将使用输入作为向右移动一个元素的标签:

from transformers import DataCollatorForLanguageModeling

data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)

DataCollatorForLanguageModeling能高效地将dataset(即train_dataset或eval_dataset中的样本)中一批样本转化输入格式,而且能在batch层面按最长序列进行dynamically padding

DataCollatorForLanguageModeling继承自DataCollatorMixin(根据输入return_tensors的类型,决定调用处理哪种类型矩阵torch、tensorflow、numpy的方法)

  • mlm设为False (为True是做BERT的掩词预训练)
  • 主要使用_torch_collate_batch来构造batch样本输入
@dataclass
class DataCollatorForLanguageModeling(DataCollatorMixin):
    """
    Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        mlm (`bool`, *optional*, defaults to `True`):
            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs
            with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked
            tokens and the value to predict for the masked token.
        mlm_probability (`float`, *optional*, defaults to 0.15):
            The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".

    <Tip>

    For best performance, this data collator should be used with a dataset having items that are dictionaries or
    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.

    </Tip>"""

    tokenizer: PreTrainedTokenizerBase
    mlm: bool = True
    mlm_probability: float = 0.15
    pad_to_multiple_of: Optional[int] = None
    tf_experimental_compile: bool = False
    return_tensors: str = "pt"

    def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
        # Handle dict or lists with proper padding and conversion to tensor.
        if isinstance(examples[0], Mapping):
            batch = pad_without_fast_tokenizer_warning(
                self.tokenizer, examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of
            )
        else:
            batch = {
                "input_ids": _torch_collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
            }

        # If special token mask has been preprocessed, pop it from the dict.
        special_tokens_mask = batch.pop("special_tokens_mask", None)
        if self.mlm:
            batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
                batch["input_ids"], special_tokens_mask=special_tokens_mask
            )
        else:
            labels = batch["input_ids"].clone()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            batch["labels"] = labels
        return batch

def _torch_collate_batch(examples, tokenizer, pad_to_multiple_of: Optional[int] = None):
    """Collate `examples` into a batch, using the information in `tokenizer` for padding if necessary."""
    import torch

    # Tensorize if necessary.
    if isinstance(examples[0], (list, tuple, np.ndarray)):
        examples = [torch.tensor(e, dtype=torch.long) for e in examples]

    length_of_first = examples[0].size(0)

    # Check if padding is necessary.
    are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)
    if are_tensors_same_length and (pad_to_multiple_of is None or length_of_first % pad_to_multiple_of == 0):
        return torch.stack(examples, dim=0)

    # If yes, check if we have a `pad_token`.
    if tokenizer._pad_token is None:
        raise ValueError(
            "You are attempting to pad samples but the tokenizer you are using"
            f" ({tokenizer.__class__.__name__}) does not have a pad token."
        )

    # Creating the full tensor and filling it with our data.
    max_length = max(x.size(0) for x in examples)
    if pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
        max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
    result = examples[0].new_full([len(examples), max_length], tokenizer.pad_token_id)
    for i, example in enumerate(examples):
        if tokenizer.padding_side == "right":
            result[i, : example.shape[0]] = example
        else:
            result[i, -example.shape[0] :] = example
    return result

测试一下

from transformers import DataCollatorForLanguageModeling

data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
features = [tokenizer('实事求是'), tokenizer('没有调查就没发言权')]
print(data_collator(features))
{'input_ids': tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 22522, 252, 12859, 233, 162, 109, 224, 42468], [50256, 162, 110, 94, 17312, 231, 164, 108, 225, 162, 253, 98, 22887, 109, 162, 110, 94, 20998, 239, 164, 101, 222, 30266, 225]]), 
'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 
'labels': tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 22522, 252, 12859, 233, 162, 109, 224, 42468], [-100, 162, 110, 94, 17312, 231, 164, 108, 225, 162, 253, 98, 22887, 109, 162, 110, 94, 20998, 239, 164, 101, 222, 30266, 225]])}

可以看到,两个样本进行了token化并padding构成了一个batch(在序列左边padding了50256 token,相应label变成-100)

2.2 处理数据

我们需要创建一些预处理函数来格式化输入数据集,确保其适合微调过程。

在这里,我们将对话摘要(提示-回应)对转换为LLM(大型语言模型)的明确指令。

from functools import partial
from transformers import set_seed

seed=42
set_seed(seed)

def create_prompt_formats(sample):
    """构造微调的prompt格式: ('instruction','output')"""
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"
    
    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt

    return sample

# 获取模型的上下文长度
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length
# 预处理batch样本(分词后,按最大长度截断token序列并构造成batch)
def preprocess_batch(batch, tokenizer, max_length):
    """ Tokenizing a batch """
    return tokenizer(batch["text"], max_length=max_length, truncation=True)

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    # 对每一个样本添加prompt
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # 对dataset中的每一个batch进行preprocess_batch预处理,然后移除'instruction', 'context', 'response', 'category'字段
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # 保留input_ids短于max_length的样本
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    dataset = dataset.shuffle(seed=seed)
    return dataset

数据集处理

train_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['train'])
eval_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['validation'])
train_dataset

三、训练

3.1 加载模型
使用QLora量化模型减少微调所需的显存要求。

bitsandbytes设置经验法则是:

  • 如果内存有限制,使用double_quant
  • 使用NF4以获得更高的精度
  • 使用16 bit float加快微调速度
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
​
# 设置bitsandbytes参数,参考官方文章:https://zhuanlan.zhihu.com/p/665601576
# 使用NF4量化加载 4bit模型的示例 (以4bit存储权重)
bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,  
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,      # 使用了两轮量化
    bnb_4bit_compute_type=torch.float16, # 计算时使用16bit 
    llm_int8_enable_fp32_cpu_offload=True
)

​
model=AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    device_map='auto',
    quantization_config=bnb_config,
    # Solving the issue: ValueError: PhiForCausalLM does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.
    trust_remote_code=True,
#     attn_implementation="flash_attention_2", # Does not be supported in here
    torch_dtype=torch.float16
)

模型参数在5.5G左右

model.safetensors.index.json: 100%
35.7k/35.7k [00:00<00:00, 2.46MB/s]
Downloading shards: 100%
2/2 [00:23<00:00, 10.09s/it]
model-00001-of-00002.safetensors: 100%
5.00G/5.00G [00:20<00:00, 249MB/s]
model-00002-of-00002.safetensors: 100%
564M/564M [00:02<00:00, 240MB/s]
Loading checkpoint shards: 100%
2/2 [00:00<00:00, 2.39it/s]
generation_config.json: 100%
124/124 [00:00<00:00, 9.46kB/s]
adapter_config.json: 100%
617/617 [00:00<00:00, 39.0kB/s]
model.config.quantization_config

Loading widget...
BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "float32",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": true,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

查看模型上下文最大长度

max_length=get_max_length(model)

Found max lenth: 2048

通过 prepare_model_for_kbit_training(model)方法让量化模型变成可lora训练

from peft import prepare_model_for_kbit_trainin

# save memory
model.gradient_checkpointing_enable()
model=prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

model

模型结构

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=True)
)
from peft import LoraConfig, TaskType, get_peft_model

# lora的秩设为16
# 微调参数设为:PhiAttention的QKV的投影矩阵和dense,以及PhiMLP的两层前馈神经网络参数
peft_config=LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense',
        'fc1',
        'fc2',
    ],
    bias="none",
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

查看微调可训练参数

trainable params: 23,592,960 || all params: 2,803,276,800 || trainable%: 0.8416207775129448

使用Trainer进行训练

import time
from transformers import TrainingArguments, Trainer
​
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    overwrite_output_dir=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=5,
    gradient_checkpointing=True,  # 启用gradient checkpointing
    gradient_checkpointing_kwargs={"use_reentrant": False},
    warmup_steps=50,
    max_steps=100,      # Total number of training steps
    num_train_epochs=2, # 训练epoch数量
    learning_rate=5e-5, # 学习率
    weight_decay=0.01,  # AdamW的权重衰减系数,等价于L2正则
    optim="paged_adamw_8bit", # Keep the optimizer state and quantize it
#     bf16=True, # Do not supported in Kaggle environment, require Ampere....
    fp16=True, # use fp16 16bit(mixed) precision training instead of 32-bit training.
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2, # Limit the total number of checkpoints
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True, # Load the best model at the end of training,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME")
)
​
peft_model.config.use_cache=False
​
trainer=Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator  # 在dataset上构造batch样本
)
​
​
start_time=time.time()
trainer.train()
end_time=time.time()
​
training_time=end_time-start_time
​
print(f"Training completed in {training_time} seconds.")

训练状态(2张T4 GPU进行训练,大概各占用10GB)

-------------------------------[100/100 20:52, Epoch 10/10]
Step	Training Loss	Validation Loss
50	1.409500	1.382088
100	1.281700	1.351633
Training completed in 1266.7061877250671 seconds.

注: 这里只是大致跑通流程,训练好的lora模型可以按需上传到wandb中

四、评估

清除变量,释放显存

import gc

del model,peft_model, tokenizer, trainer
gc.collect()

torch.cuda.empty_cache()

加载tokenizer

eval_tokenizer=AutoTokenizer.from_pretrained(
    'aisuko/'+os.getenv('WANDB_NAME'),
    add_bos_token=True,
    trust_remote_code=True,
    use_fast=False
)

eval_tokenizer.pad_token=eval_tokenizer.eos_token
print(eval_tokenizer)

加载训练好的Lora参数并合并到模型中

from peft import PeftModel
​
model=AutoModelForCausalLM.from_pretrained(
    os.getenv('MODEL_NAME'),
    device_map='auto',
    trust_remote_code=True,
    torch_dtype=torch.float16
)
​
eval_model=PeftModel.from_pretrained(
    model, 'aisuko/'+os.getenv('WANDB_NAME'), device_map='auto'
)

使用model.generate的模型推理预测代码

ef inference(model, prompt, max_length=200):
    tokens=eval_tokenizer(prompt, return_tensors='pt')
    res=model.generate(
        **tokens.to('cuda'),
        max_new_tokens=max_length,
        do_sample=True,
        num_return_sequences=1,
        temperature=0.1,
        num_beams=1,
        top_p=0.95
    )
    return eval_tokenizer.batch_decode(res, skip_special_tokens=False)

对话摘要推理代码

dialogue=dataset['test'][5]['dialogue']
summary=dataset['test'][5]['summary']

prompt=f'Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n'

peft_model_res=inference(eval_model, prompt, 100)
peft_model_output=peft_model_res[0].split('Output:\n')[1]

prefix, success, result=peft_model_output.partition('###')

dashline='-'.join('' for x in range(100))
print(prompt)
print(dashline)

例子1

Instruct: Summarize the following conversation.
#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can start biking to work. That will give you just as much freedom as your car usually provides.
#Person2#: That's true. I could certainly use the exercise!
#Person1#: So, are you going to quit driving to work then?
#Person2#: Yes, it's not good for me or for the environment.
Output:

---------------------------------------------------------------------------------------------------

查看原始标注结果

print(summary)
#Person2# complains to #Person1# about the traffic jam, #Person1# suggests quitting driving and taking public transportation instead.

查看微调模型的预测结果

#Person2# tells #Person1# that they got stuck in traffic again and that it's always congested near the Carrefour intersection during rush hour. #Person1# suggests that #Person2# should consider taking public transport system to work, and #Person2# agrees that it would be better for the environment. #Person1# also suggests that #Person2# could start biking to work when it's nicer outside. #Person2# decides to quit driving to work.

例子2

Instruct: Summarize the following conversation.
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Where were we?
#Person2#: This applies to internal and external communications.
#Person1#: Yes. Any employee who persists in using Instant Messaging will first receive a warning and be placed on probation. At second offense, the employee will face termination. Any questions regarding this new policy may be directed to department heads.
#Person2#: Is that all?
#Person1#: Yes. Please get this memo typed up and distributed to all employees before 4 pm.
Output:

这种对话摘要的问题依然是各种幻觉,例如无中生有、篡改原文等,PEFT或直接使用prompt大概只能修改输出格式。

小参数的LLM微调的好处在于,显存需求少、部署难度低,而且相比T5-pegasus这种架构,直接用prompt就可以初步使用,当然最终要达到业务目标肯定还有很大的提升空间:

  • 不仅要对输入进行清洗过滤,
  • 还要更换更强的底座,加大样本,甚至全参数微调
  • Lora调参【2】【3】或魔改(例如LoraMOE等)

相关资料

标签:Phi,tokenizer,QLoRA,max,dataset,length,LLM,model,True
From: https://www.cnblogs.com/justLittleStar/p/18071717

相关文章

  • 【Coursera GenAI with LLM】 Week 2 Fine-tuning LLMs with instruction Class Notes
    GenAIProjectLifecycle:Afterpickingpre-trainedmodels,wecanfine-tune!In-contextlearning(ICL):zero/one/fewshotinference.Includingafewmodelsinthepromptformodeltolearnandgenerateabettercomplement(akaoutput).Itsdrawbacks......
  • AI推介-大语言模型LLMs论文速览(arXiv方向):2024.03.05-2024.03.10—(1)
    文章目录~1.EditingConceptualKnowledgeforLargeLanguageModels2.TRAD:EnhancingLLMAgentswithStep-WiseThoughtRetrievalandAlignedDecision3.AreYouBeingTracked?DiscoverthePowerofZero-ShotTrajectoryTracingwithLLMs!4.CanLLMSubstit......
  • 来自 AI Secure 实验室的 LLM 安全排行榜简介
    近来,LLM已深入人心,大有燎原之势。但在我们将其应用于千行百业之前,理解其在不同场景下的安全性和潜在风险显得尤为重要。为此,美国白宫发布了关于安全、可靠、可信的人工智能的行政命令;欧盟人工智能法案也对高风险人工智能系统的设立了专门的强制性要求。在这样的大背景下,我们首......
  • Delphi TDictionary字典类
    TDictionary是一个开放哈希表,支持泛型,并提供了以下常用属性和方法:属性:Count:返回字典中元素的数量。Keys:返回字典中所有键所组成的动态数组。Values:返回字典中所有值所组成的动态数组。方法:Add:向字典中添加一个键和其对应的值。Clear:清空字典中的所有元素。ContainsKey:判断字......
  • LLM 推理和应用 开源框架梳理
    之前对LLM推理和应用了解不多,因此抽时间梳理了一下,我们从模型量化,模型推理,以及开发平台等三个层面来梳理分析。模型量化模型训练时为了进度,采用的32位浮点数,因此占用的空间较大,一些大的模型需要很大的显存才能加载,且计算推理过程较慢。为了减少内存占用,提升推理速度,可以将高精......
  • delphi 中 StringHelper 详解,字符串常见操作
    大小写转换functionToLower:string;functionToLower(LocaleID:TLocaleID):string;functionToLowerInvariant:string;functionToUpper:string;functionToUpper(LocaleID:TLocaleID):string;functionToUpperInvariant:string;classfunctionLowerCase(const......
  • Apache DolphinScheduler-3.2.0集群部署教程
    集群部署方案(2Master+3Worker)ApacheDolphinScheduler官网:https://dolphinscheduler.apache.org/zh-cnApacheDolphinScheduler使用文档:https://dolphinscheduler.apache.org/zh-cn/docs/3.2.0截止2024-01-19,最新版本:3.2.0部署版本:apache-dolphinscheduler-3.2.0-bin.t......
  • dolphinscheduler 实现master宕机故障转移能力源码分析
    DS(dolphinscheduler)的master是去中心化的,而故障转移能力是由master完成的,那么是多个master同时干故障转移,还是选举出一个master来干这件事情呢?回归到源码进行分析1.master启动方法@PostConstructpublicvoidrun()throwsSchedulerException{....this.failoverE......
  • 有限制的 bellman_ford 算法
    题目链接1.有限制的\(Bellman\_Ford\)时间复杂度:\(O(N*M)\)在传统的\(Bellman\_Ford\)中,可以处理边数不大于\(K\)条边的最短距离但我们只要加一条限制(实际上只多了两行代码)就可以实现求恰好等于\(K\)条边的最短距离具体的就在于其核心代码中:for(inti=0;i......
  • llm构建数据标注助手
    为什么要用LLM构建数据标注工具在LLM出现之前,传统的深度学习模型(包括CV和NLP)就已经需要大量的数据进行训练和微调。没有足够的数据,或者数据需要进行二次加工(比如标签标注),这些问题都成为限制深度学习模型泛化的关键。对于第一个问题,一般就是从网络上爬虫或者花大价钱从数据供应商......