AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

时间：2023-05-23 14:31:34浏览次数：45

标签：AGIEval Foundation Centric foundation models capabilities tasks human GPT

Abstract

Evaluating the general abilities of foundation models to tackle human-level tasks is

a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets,

may not accurately represent human-level capabilities. In this paper, we introduce

AGIEval, a novel benchmark specifically designed to assess foundation model in

the context of human-centric standardized exams, such as college entrance exams,

law school admission tests, math competitions, and lawyer qualification tests. We

evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT,

and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95%

accuracy rate on the SAT Math test and a 92.5% accuracy on the English test

of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find

that GPT-4 is less proficient in tasks that require complex reasoning or specific

domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models’ strengths and

limitations, providing valuable insights into future directions for enhancing their

general capabilities. By concentrating on tasks pertinent to human cognition and

decision-making, our benchmark delivers a more meaningful and robust evaluation

of foundation models’ performance in real-world scenarios2

6 Conclusion

In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess the general

capabilities of large foundation models with respect to human-level cognition. The benchmark

comprises high-quality official admission tests, qualification exams, and advanced competitions

tailored for human participants, including law school admission tests and college entrance examinations. These assessments establish officially recognized standards for gauging human capabilities,

making them well-suited for evaluating foundation models in the context of human-centric tasks.

Additionally, AGIEval incorporates bilingual tasks in both Chinese and English, offering a more

comprehensive assessment of model behavior. We have carried out an extensive evaluation of three

cutting-edge large foundation models: Text-Davinci-003, ChatGPT, and GPT-4, using AGIEval.

Remarkably, GPT-4 surpasses average human performance on LSAT, SAT, and math competition,

attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the Gaokao English

test, demonstrating the impressive performance of contemporary foundation models. Despite their

significant achievements, our in-depth manual analyses also reveal the limitations of these large

language models in terms of understanding, knowledge utilization, reasoning and calculation. Guided

by these findings, we explore potential future research avenues in this domain. By assessing these

foundation models on human-centric tasks and probing their capabilities more deeply, we strive to

foster the development of models that are more closely aligned with human cognition. Ultimately, this

will enable them to tackle a broader range of intricate, human-centric tasks with increased accuracy

and reliability.

标签：AGIEval,Foundation,Centric,foundation,models,capabilities,tasks,human,GPT
From： https://blog.51cto.com/u_14897897/6331667

数据工程系列精讲（第三讲）: Data-centric AI 之特征工程（转载）
前言：在Data-centricAI之特征工程第二讲中，我们介绍了特征预处理的三个子步骤即样本类别不均衡处理，连续特征离散化和数值型category特征编码。今天我们接着介绍特征预处理以及特征工程的其他步骤。特征预处理之特征缩放当样本的不同特征的取值幅度范围具有不同量级时，数量级......
什么是人工智能领域的 Foundation Model？
人工智能领域的FoundationModel，通常指的是一类被广泛使用的基础模型（或称基础架构模型），是在海量数据和计算资源的基础上训练出来的通用、通用性较强的深度学习模型。这些模型被广泛应用于自然语言处理、计算机视觉、语音识别等领域的各种任务。FoundationModel通常由大型科技......
iOS7应用开发4、Foundation框架
1、动态绑定：id类型的对象，表示指向未知类型对象的指针；指向对象的实际类型在运行时指定。在使用时，注意check该对象是否响应调用的方法（respondsToSelector）。可以将一个静态类型......
fdb-record-layer 基于foundationdb 的record存储
fdb-record-layer是在foundationdb上包装的一层工具层，可以简化日常的开发包含的特性结构化类型，基于了protobufindex索引支持复杂类型支持list，嵌套record查询，提供了查询ap......
tigris 基于foundationdb 开发的数据平台
tigris是基于foundationdb开发的数据平台，基于golang开发，官方的介绍是mongoatlas的可选替换工具参考架构参考资料https://github.com/tigrisdata/tigrishttps://www......
fdb-record-layer 基于foundationdb 的record存储
fdb-record-layer是在foundationdb上包装的一层工具层，可以简化日常的开发包含的特性结构化类型，基于了protobufindex索引支持复杂类型支持list，嵌套record查询，提供......
Media Foundation播放器
前文已经简单介绍了MicrosoftMediaFoundation。下面我们使用它来实现一个简单的视频播放器（MF要求使用C/C++，不提供.NET接口）。初始化在使用MF之前需要先初始化HRESULT......
了解Microsoft Media Foundation
关于MicrosoftMediaFoundation是什么MicrosoftMediaFoundation是用来处理（创建、修改、传输、合成）多媒体数据（音视频）的一个平台。有什么用MicrosoftMediaFoundatio......
Linux Foundation Secure Boot System Released
Aspromised,hereistheLinuxFoundationUEFIsecurebootsystem. ThiswasactuallyreleasedtousbyMicrosoftonWednesday6February,butwithtravel,con......
Foundations of Embedded IoT Systems 复习笔记南安
Communicationshub,border-routerorGatewaylinkslowpowerhardwaretotheoutsideworldNetworkWifi/EthernetLowPowerWideAreaNetworksLowpowerIPv6n......

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

相关文章

赞助商

阅读排行