Abstract
Evaluating the general abilities of foundation models to tackle human-level tasks is
a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets,
may not accurately represent human-level capabilities. In this paper, we introduce
AGIEval, a novel benchmark specifically designed to assess foundation model in
the context of human-centric standardized exams, such as college entrance exams,
law school admission tests, math competitions, and lawyer qualification tests. We
evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT,
and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95%
accuracy rate on the SAT Math test and a 92.5% accuracy on the English test
of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find
that GPT-4 is less proficient in tasks that require complex reasoning or specific
domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models’ strengths and
limitations, providing valuable insights into future directions for enhancing their
general capabilities. By concentrating on tasks pertinent to human cognition and
decision-making, our benchmark delivers a more meaningful and robust evaluation
of foundation models’ performance in real-world scenarios2
.
6 Conclusion
In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess the general
capabilities of large foundation models with respect to human-level cognition. The benchmark
comprises high-quality official admission tests, qualification exams, and advanced competitions
tailored for human participants, including law school admission tests and college entrance examinations. These assessments establish officially recognized standards for gauging human capabilities,
making them well-suited for evaluating foundation models in the context of human-centric tasks.
Additionally, AGIEval incorporates bilingual tasks in both Chinese and English, offering a more
comprehensive assessment of model behavior. We have carried out an extensive evaluation of three
cutting-edge large foundation models: Text-Davinci-003, ChatGPT, and GPT-4, using AGIEval.
Remarkably, GPT-4 surpasses average human performance on LSAT, SAT, and math competition,
attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the Gaokao English
test, demonstrating the impressive performance of contemporary foundation models. Despite their
significant achievements, our in-depth manual analyses also reveal the limitations of these large
language models in terms of understanding, knowledge utilization, reasoning and calculation. Guided
by these findings, we explore potential future research avenues in this domain. By assessing these
foundation models on human-centric tasks and probing their capabilities more deeply, we strive to
foster the development of models that are more closely aligned with human cognition. Ultimately, this
will enable them to tackle a broader range of intricate, human-centric tasks with increased accuracy
and reliability.
标签:AGIEval,Foundation,Centric,foundation,models,capabilities,tasks,human,GPT From: https://blog.51cto.com/u_14897897/6331667