首页 > 其他分享 >大模型量化3

大模型量化3

时间:2023-10-03 13:24:13浏览次数:53  
标签:use 4bit 模型 quant double 量化 model bit

https://huggingface.co/blog/4bit-transformers-bitsandbytes

 

1.  8 位float 

The FP8 (floating point 8) format has been first introduced in the paper “FP8 for Deep Learning” with two different FP8 encodings: E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa).

 

 The potential floating points that can be represented in the E4M3 format are in the range -448 to 448, whereas in the E5M2 format, as the number of bits of the exponent increases, the range increases to -57344 to 57344 - but with a loss of precision because the number of possible representations remains constant. It has been empirically proven that the E4M3 is best suited for the forward pass, and the second version is best suited for the backward computation

这部分意思就是8bit 浮点数. E4M3精度高,所以forward比较好, E5M2精度低bakward比较好.这是经验上的结论. 我理解是求梯度时候不要求那么精确. 因为反正需要走一个方向做梯度下降即可. 但是forward运算必须保证精度才能效果好!!!!!!!这是很trivial的道理.

 https://aijishu.com/a/1060000000385081          这里面讲解FP8的详细知识. 

有两个点需要注意: 第一个是指数偏移量是-7, 第二个是小数部分求和之后要加一. 一些特例要另算.

FP4:  

for example, with 2 exponent bits and one mantissa bit the representations 1101 would be:

-1 * 2^(2) * (1 + 2^-1) = -1 * 4 * 1.5 = -6

1101: 1表示负数. 10表示指数2,  1表示1.5

整体上跟fp8区别不大. 这种叫2e1m 还可以3e0m

Qlora:

  More specifically, QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then frozen and a relatively small number of trainable parameters are added to the model in the form of Low-Rank Adapters. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. The LoRA layers are the only parameters being updated during training. Read more about LoRA in the original LoRA paper.

  QLora就是 fp4版本的lora

实战版本:

Advanced usage: 这里面代码很实用.

 

from transformers import BitsAndBytesConfig


nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

 

 

from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,       # 双量化, 开启后精度更高
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)

选参心得:

A rule of thumb is: use double quant if you have problems with memory, use NF4 for higher precision, and use a 16-bit dtype for faster finetuning. For instance in the inference demo, we use nested quantization, bfloat16 compute dtype and NF4 quantization to fit gpt-neo-x-20b (40GB) entirely in 4bit in a single 16GB GPU.

所以推荐是双量化, nested 就是嵌套也就是双重的意思. 然后bf16, nf4. 即可.

硬件:

  Note that this method is only compatible with GPUs, hence it is not possible to quantize models in 4bit on a CPU. Among GPUs, there should not be any hardware requirement about this method, therefore any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed. Keep also in mind that the computation is not done in 4bit, the weights and activations are compressed to that format and the computation is still kept in the desired or native dtype.

训练:

Can we train 4bit/8bit models?

It is not possible to perform pure 4bit training on these models. However, you can train these models by leveraging parameter efficient fine tuning methods (PEFT) and train for example adapters on top of them. That is what is done in the paper and is officially supported by the PEFT library from Hugging Face. We also provide a training notebook and recommend users to check the QLoRA repository if they are interested in replicating the results from the paper.

使用peft可以训练.

应用前景:

  In RLHF (Reinforcement Learning with Human Feedback) it is possible to load a single base model, in 4bit and train multiple adapters on top of it, one for the reward modeling, and another for the value policy training. A more detailed blogpost and announcement will be made soon about this use case.

  在rlhf中. 同一个basemodel, 接2个不同的adapter 头. 一个作为奖励模型, 一个座位policy 模型.

 

标签:use,4bit,模型,quant,double,量化,model,bit
From: https://www.cnblogs.com/zhangbo2008/p/17740319.html

相关文章

  • redis7源码分析:redis 多线程模型解析
    多线程模式中,在main函数中会执行InitServerLastvoidInitServerLast(){bioInit();//关键一步,这里启动了多条线程,用于执行命令,redis起名为IO线程initThreadedIO();set_jemalloc_bg_thread(server.jemalloc_bg_thread);server.initial_memory_usage=......
  • redis7源码分析:redis 单线程模型解析,一条get命令执行流程
    有了下文的梳理后redis启动流程再来解析redis在单线程模式下解析并处理客户端发来的命令1.当clientfd可读时,会回调readQueryFromClient函数voidreadQueryFromClient(connection*conn){client*c=connGetPrivateData(conn);intnread,big_arg=0;size_......
  • 聊聊基于Alink库的随机森林模型
    概述随机森林(RandomForest)是一种集成学习(EnsembleLearning)方法,通过构建多个决策树并汇总其预测结果来完成分类或回归任务。每棵决策树的构建过程中都引入了随机性,包括数据采样和特征选择的随机性。随机森林的基本原理可以概括如下:随机抽样训练集:随机森林通过有放回抽样(Boots......
  • python 机器学习 继续训练模型
    您可以使用以下方法反复训练机器学习模型:增量学习:这是一种在现有模型上继续训练的方法。在增量学习中,您可以将新数据集与现有数据集合并,然后使用这些数据重新训练模型。这种方法的优点是可以避免从头开始训练模型,从而节省时间和计算资源。但是,需要注意的是,如果新数据与旧数据有很大......
  • 科技云报道:AI大模型终于走到了数据争夺战
    当前,大模型正处在产业落地前期,高质量的数据,是大模型实现产业化的关键要素。最近,一项来自EpochAIResearch团队的研究抛出了一个残酷的事实:模型还要继续做大,数据却不够用了。研究人员预测了2022年至2100年间可用的图像和语言数据总量,并据此估计了未来大模型训练数据集规模的增长趋......
  • Llama2-Chinese项目:3.2-LoRA微调和模型量化
      提供LoRA微调和全量参数微调代码,训练数据为data/train_sft.csv,验证数据为data/dev_sft.csv,数据格式为"<s>Human:"+问题+"\n</s><s>Assistant:"+答案。本文主要介绍Llama-2-7b模型LoRA微调以及4bit量化的实践过程。1.LoRA微调脚本  LoRA微调脚本train/sft/finetune_lora......
  • CLIP模型代码
    近期看到了一篇用CLIP在我这个方向应用的文章,所以玩了一下CLIP,感觉效果还是很好的。 首先,github上的zero-shot代码importosimportclipimporttorchfromtorchvision.datasetsimportCIFAR100#Loadthemodeldevice="cuda"iftorch.cuda.is_available()else"cp......
  • 手把手教你在Ubuntu上部署中文LLAMA-2大模型
     一、前言 llama2作为目前最优秀的的开源大模型,相较于chatGPT,llama2占用的资源更少,推理过程更快,本文将借助llama.cpp工具在ubuntu(x86\ARM64)平台上搭建纯CPU运行的中文LLAMA2中文模型。二、准备工作 1、一个Ubuntu环境(本教程基于Ubuntu20LTS版操作) 2、确保你的环境可......
  • 全新注意力算法PagedAttention:LLM吞吐量提高2-4倍,模型越大效果越好
    前言 吞吐量上不去有可能是内存背锅!无需修改模型架构,减少内存浪费就能提高吞吐量!本文转载自新智元仅用于学术分享,若侵权请联系删除欢迎关注公众号CV技术指南,专注于计算机视觉的技术总结、最新技术跟踪、经典论文解读、CV招聘信息。CV各大方向专栏与各个部署框架最全教程整理......
  • 【8.0】Fastapi响应模型
    【一】自定义响应模型【1】定义视图函数fromfastapiimportAPIRouterfrompydanticimportBaseModel,EmailStrfromtypingimportOptionalapp04=APIRouter()###响应模型#定义基本类classUserBase(BaseModel):#定义字段username:用户名类型为str:......