首页 > 其他分享 >Phi-2: The surprising power of small language models

Phi-2: The surprising power of small language models

时间:2024-09-20 21:52:10浏览次数:1  
标签:Phi Llama shot power language models benchmarks

Phi-2: The surprising power of small language models

https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

 

Phi-2 Evaluation

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).

ModelSizeBBHCommonsense
Reasoning
Language
Understanding
MathCoding
Llama-2 7B 40.0 62.2 56.7 16.5 21.0
13B 47.8 65.0 61.9 34.2 25.4
70B 66.5 69.2 67.6 64.1 38.3
Mistral 7B 57.2 66.4 63.7 46.4 39.4
Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7
Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
ModelSizeBBHBoolQMBPPMMLU
Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8
Phi-2 2.7B 59.3 83.3 59.1 56.7
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

 

标签:Phi,Llama,shot,power,language,models,benchmarks
From: https://www.cnblogs.com/lightsong/p/18423354

相关文章

  • 通过VMware.PowerCLI工具连接vcenter,批量修改esxi主机的密码
    工作需要研究了一下。通过下面的脚本可以批量修改esxi的密码,如果忘记密码也可以用这个方法首先准备好esxi主机列表的信息,做成一个csv文件,里面要包含host username password这三个字段然后用下面的脚本。使用你的vcenter管理员账号密码,登录后导入csv文件信息,做批量的修改#安......
  • 中电信翼康基于Apache Dolphinscheduler重构“星海·济世医疗数据中台”实践经验分享
    文章作者:尚志忠编辑整理:曾辉行业背景随着大数据、云计算、5G、人工智能等技术的快速发展,以及医疗信息化建设的不断深入,数据中台作为打通医疗数据融合壁垒、实现数据互通与共享、构建高效数据应用的关键信息平台,正逐渐成为推动医疗行业数字化转型和创新发展的重要力量。星海·......
  • Gephi 0.9.2中文版百度云下载(附教程)
    如大家所了解的,Gephi常用于各种图形和网络的可视化和探索,是最受欢迎的网络可视化软件之一。在生物科学领域,常用于基因共表达网络、蛋白互作网络、微生物相互关系网络等等类似的网络图形绘制。目前用的比较多的版本为Gephi0.9.2,下面一起来看看、了解和熟悉这款实用工具吧!Gep......
  • Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
    文章汇总本文的作者针对了提示学习的结构设计进行了分析,发现了一些规律:1)固定的类名令牌为模型的优化提供了强正则化,减少了由噪声样本引起的梯度。2)从多样化和通用的web数据中学习到的强大的预训练图像文本嵌入为图像分类提供了强大的先验知识。3)CLIP的噪声零样本预测......
  • 使用 PowerShell 管理 DNS 服务器,你可以执行多种操作,如添加、删除和修改 DNS 记录,以及
    使用PowerShell管理DNS服务器,你可以执行多种操作,如添加、删除和修改DNS记录,以及管理DNS区域。以下是一些常用的cmdlet示例:查看所有DNS区域powershellCopyCodeGet-DnsServerZone添加新的DNS区域powershellCopyCodeAdd-DnsServerPrimaryZone-Name"yourdomai......
  • 怎么办?用DolphinScheduler调度执行复杂的HiveSQL时无法正确识别符号
    在使用ApacheDolphinScheduler调度执行复杂的HiveSQL时,HQL包含多种海豚无法正确识别的符号,怎么办?本文提供了可行的思路和方法,供用户参考。一、目的在Hive中完成复杂JSON,既有对象还有数组而且数组中包含数组的解析后,原本以为没啥问题了,结果在DolphinScheduler中调度又出现了大问......
  • 怎么办?用DolphinScheduler调度执行复杂的HiveSQL时无法正确识别符号
    在使用ApacheDolphinScheduler调度执行复杂的HiveSQL时,HQL包含多种海豚无法正确识别的符号,怎么办?本文提供了可行的思路和方法,供用户参考。一、目的在Hive中完成复杂JSON,既有对象还有数组而且数组中包含数组的解析后,原本以为没啥问题了,结果在DolphinScheduler中调度又出现了大问......
  • Analysis of Code and Test-Code generated by Large Language Models
    本文是LLM系列文章,针对《AnalysisofCodeandTest-CodegeneratedbyLargeLanguageModels》的翻译。大型语言模型生成的代码和测试代码的分析摘要1引言2方法3进行实验4测试结果的评估5讨论6相关工作7结论和未来工作摘要ChatGPT和Copilot等......
  • Imitating Language via Scalable Inverse Reinforcement Learning
    本文是LLM系列文章,针对《ImitatingLanguageviaScalableInverseReinforcementLearning》的翻译。通过可扩展的逆向强化学习模仿语言摘要1引言2方法3实验4相关工作5讨论6结论摘要大多数语言模型训练都建立在模仿学习的基础上。它涵盖了预训练、监......
  • PowerShell 命令来备份 Windows 10 的服务列表:CMD 批处理命令来备份 Windows 10 的服
    PowerShell命令来备份Windows10的服务列表:powershellCopyCodeGet-Service|Export-Csv-Path"C:\ServiceListBackup.csv"-NoTypeInformation这条命令会将所有服务信息导出到C:\ServiceListBackup.csv文件中。确保您有写入该路径的权限。CMD批处理命令来备份Windo......