基础任务
评测 API 模型
创建用于评测 conda 环境
conda create -n opencompass python=3.10
conda activate opencompass
cd /root
git clone -b 0.3.3 https://github.com/open-compass/opencompass
cd opencompass
pip install -e .
填写API KEY
export INTERNLM_API_KEY=xxxxxxxxxxxxxxxxxxxxxxx # 填入你申请的 API Key
配置模型:
配置数据集
运行
python run.py --models puyu_api.py --datasets demo_cmmlu_chat_gen.py --debug .
评测本地模型
安装相关软件包:
cd /root/opencompass
conda activate opencompass
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
apt-get update
apt-get install cmake
pip install protobuf==4.25.3
pip install huggingface-hub==0.23.2
将数据集下载到本地
cp /share/temp/datasets/OpenCompassData-core-20231110.zip /root/opencompass/
unzip OpenCompassData-core-20231110.zip
加载本地模型进行评测
启动评估
python run.py --datasets ceval_gen --models hf_internlm2_5_1_8b_chat --debug
运行结果:
问题1:
'torch’ has no attribute ‘float8_e4m3fnuz’
解决方法:降低transformers版本pip install transformers== 4.39.3
问题2:ModuleNotFoundError: No module named ‘rouge’
解决方法:pip uninstall rouge
之后再次安装pip install rouge==1.0.1
问题3:ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
解决方法:降低numpy版本pip install numpy==1.21
同时还可能出现一些包缺失的问题,pip安装即可
进阶任务
主观评测
配置文件
cd /root/opencompass/configs/
touch eval_zhuguan_demo.py
贴入以下代码:
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from opencompass.models import HuggingFacewithChatTemplate, OpenAISDK
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='internlm2_5-1_8b-chat-hf',
path='/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat/',
max_out_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1),
)
]
datasets = [*alignbench_datasets]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
judge_models = [dict(
type=OpenAISDK,
path='qwen-turbo',# 填写用来评估的模型名称
key='sk-xxxxxxxxxxxxxxxxxxxxxxx', # 填入你的 API Key
openai_api_base='https://dashscope.aliyuncs.com/compatible-mode/v1',
meta_template=api_meta_template,
query_per_second=16,
max_out_len=2048,
max_seq_len=2048,
batch_size=8,
temperature=0,
)]
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'
OpenAI APIkey部分如果没有或者不想购买会员的,可以其他,诸如Claude、Llama、通义千问等的API来替代,我这里就用了qwen-turbo,但这个是用于评判的模型,尽可能选用性能好的大模型。
更多详情可以看主观评测指引,以及
configs
文件夹下的eval_subjective.py
文件
运行
cd /root/opencompass/
python run.py configs/eval_zhuguan_demo.py --debug
运行时间比较长,大概要2小时左右
结果如下
打开最后运行输出的文件如下:
将本地模型通过部署成API服务再评测
安装和部署模型:
pip install lmdeploy==0.6.1 openai==1.52.0
lmdeploy serve api_server /share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat/ --server-port 23333
新开一个终端创建一个python文件,插入以下代码
运行结果如下:
创建配置脚本/root/opencompass/configs/models/hf_internlm/hf_internlm2_5_1_8b_chat_api.py
运行脚本
opencompass --models hf_internlm2_5_1_8b_chat_api --datasets ceval_gen --debug
运行结果