测试脚本
PRE_SEQ_LEN=64
CHECKPOINT=dsbtpg-chatglm-6b-pt-64-2e-2
STEP=500
CUDA_VISIBLE_DEVICES=0 python3 main.py \
--do_predict \
--validation_file devVX.json \
--test_file devVX.json \
--overwrite_cache \
--prompt_column content \
--response_column summary \
--model_name_or_path /home/lyc/workspace/ChatGLM-6B/chatglm-6b \
--ptuning_checkpoint ./output/$CHECKPOINT/checkpoint-$STEP \
--output_dir ./output/$CHECKPOINT \
--overwrite_output_dir \
--max_source_length 64 \
--max_target_length 64 \
--per_device_eval_batch_size 1 \
--predict_with_generate \
--pre_seq_len $PRE_SEQ_LEN \
--quantization_bit 8
测试过程
99%|████████████████████████████████████████████████████████████████████████████████▍| 139/140 [01:50<00:00, 1.27it/s][INFO|configuration_utils.py:575] 2024-05-21 13:41:44,210 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 130004,
"eos_token_id": 130005,
"pad_token_id": 3,
"transformers_version": "4.27.1"
}
100%|█████████████████████████████████████████████████████████████████████████████████| 140/140 [01:51<00:00, 1.27it/s]Building prefix dict from the default dictionary ...
05/21/2024 13:41:45 - DEBUG - jieba - Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
05/21/2024 13:41:45 - DEBUG - jieba - Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.680 seconds.
05/21/2024 13:41:45 - DEBUG - jieba - Loading model cost 0.680 seconds.
Prefix dict has been built successfully.
05/21/2024 13:41:45 - DEBUG - jieba - Prefix dict has been built successfully.
100%|█████████████████████████████████████████████████████████████████████████████████| 140/140 [01:51<00:00, 1.25it/s]
***** predict metrics *****
predict_bleu-4 = 76.3107
predict_rouge-1 = 83.1915
predict_rouge-2 = 77.6409
predict_rouge-l = 91.1686
predict_runtime = 0:01:53.47
predict_samples = 140
predict_samples_per_second = 1.234
predict_steps_per_second = 1.234
main.py
的evaluate脚本可以在指定数据集上评估微调后模型的好坏,其使用了BLUE
和ROUGE
两个经典的指标,前者通过比较机器翻译结果和人工翻译结果之间的 n-gram 匹配度来计算相似度,后者通过计算摘要中单词或短语的召回率来评估摘要的质量,两者都分数越高,模型表现质量越好。
注:这里由于训练问题重新训练了,上述数据仅作为格式展示,重跑后结果为:
***** predict metrics *****
predict_bleu-4 = 99.3069
predict_rouge-1 = 99.449
predict_rouge-2 = 99.3863
predict_rouge-l = 99.7142
predict_runtime = 0:02:19.46
predict_samples = 168
predict_samples_per_second = 1.205
predict_steps_per_second = 1.205
参考资料
[1] 机器翻译与自动文摘评价指标 BLEU 和 ROUGE:如何理解和应用-百度开发者中心 (baidu.com)
标签:Tuning,--,token,predict,v2,64,实训,rouge,output From: https://www.cnblogs.com/yichengliu0219/p/18264222