写在前面
很多问题尚未弄清,还在进一步调整
目前已知
我用8卡的3090
采用deepspeed ZeRO3进行运行,下面是deepspeed3的配置
1 { 2 "fp16": { 3 "enabled": "auto", 4 "loss_scale": 0, 5 "loss_scale_window": 1000, 6 "initial_scale_power": 16, 7 "hysteresis": 2, 8 "min_loss_scale": 1 9 }, 10 11 "optimizer": { 12 "type": "AdamW", 13 "params": { 14 "lr": "auto", 15 "betas": "auto", 16 "eps": "auto", 17 "weight_decay": "auto" 18 } 19 }, 20 21 "scheduler": { 22 "type": "WarmupLR", 23 "params": { 24 "warmup_min_lr": "auto", 25 "warmup_max_lr": "auto", 26 "warmup_num_steps": "auto" 27 } 28 }, 29 30 "zero_optimization": { 31 "stage": 3, 32 "overlap_comm": true, 33 "contiguous_gradients": true, 34 "sub_group_size": 1e9, 35 "reduce_bucket_size": "auto", 36 "stage3_prefetch_bucket_size": "auto", 37 "stage3_param_persistence_threshold": "auto", 38 "stage3_max_live_parameters": 1e9, 39 "stage3_max_reuse_distance": 1e9, 40 "stage3_gather_16bit_weights_on_model_save": true 41 }, 42 43 "gradient_accumulation_steps": "auto", 44 "gradient_clipping": "auto", 45 "steps_per_print": 2000, 46 "train_batch_size": "auto", 47 "train_micro_batch_size_per_gpu": "auto", 48 "wall_clock_breakdown": false 49 }
这是运行命令代码
已知per_device_batch_size必须调大
deepspeed --num_gpus=8 src/train_bash.py \ --stage sft \ --model_name_or_path /hy-tmp/tigerbot-70b-chat-v4-4k \ --do_train True \ --finetuning_type lora \ --template tigerbot \ --dataset_dir data \ --dataset self_cognition_golden \ --cutoff_len 1024 \ --learning_rate 0.01 \ --num_train_epochs 1.0 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --save_steps 100 \ --lora_rank 256 \ --lora_dropout 0.1 \ --lora_target q_proj,v_proj \ --output_dir saves \ --fp16 True \ --plot_loss True \ --deepspeed deepspeed.json
这是运行代码
标签:--,auto,llama2,3090,train,steps,stage3,70B,size From: https://www.cnblogs.com/alphainf/p/17889332.html