Phi-2: The surprising power of small language models

https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

Phi-2 Evaluation

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).

Model Size BBH Commonsense
Reasoning Language
Understanding Math Coding

Llama-2 7B 40.0 62.2 56.7 16.5 21.0

13B 47.8 65.0 61.9 34.2 25.4

70B 66.5 69.2 67.6 64.1 38.3

Mistral 7B 57.2 66.4 63.7 46.4 39.4

Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7

Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.

Model Size BBH BoolQ MBPP MMLU

Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8

Phi-2 2.7B 59.3 83.3 59.1 56.7

Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

Model	Size	BBH	Commonsense Reasoning	Language Understanding	Math	Coding
Llama-2	7B	40.0	62.2	56.7	16.5	21.0
13B	47.8	65.0	61.9	34.2	25.4
70B	66.5	69.2	67.6	64.1	38.3
Mistral	7B	57.2	66.4	63.7	46.4	39.4
Phi-2	2.7B	59.2	68.8	62.0	61.1	53.7

Model	Size	BBH	BoolQ	MBPP	MMLU
Gemini Nano 2	3.2B	42.4	79.3	27.2	55.8
Phi-2	2.7B	59.3	83.3	59.1	56.7

标签：Phi,Llama,shot,power,language,models,benchmarks
From： https://www.cnblogs.com/lightsong/p/18423354

通过VMware.PowerCLI工具连接vcenter，批量修改esxi主机的密码
工作需要研究了一下。通过下面的脚本可以批量修改esxi的密码，如果忘记密码也可以用这个方法首先准备好esxi主机列表的信息，做成一个csv文件，里面要包含host username password这三个字段然后用下面的脚本。使用你的vcenter管理员账号密码，登录后导入csv文件信息，做批量的修改#安......
Gephi 0.9.2中文版百度云下载（附教程）
如大家所了解的，Gephi常用于各种图形和网络的可视化和探索，是最受欢迎的网络可视化软件之一。在生物科学领域，常用于基因共表达网络、蛋白互作网络、微生物相互关系网络等等类似的网络图形绘制。目前用的比较多的版本为Gephi0.9.2，下面一起来看看、了解和熟悉这款实用工具吧！Gep......
使用 PowerShell 管理 DNS 服务器，你可以执行多种操作，如添加、删除和修改 DNS 记录，以及
使用PowerShell管理DNS服务器，你可以执行多种操作，如添加、删除和修改DNS记录，以及管理DNS区域。以下是一些常用的cmdlet示例：查看所有DNS区域powershellCopyCodeGet-DnsServerZone添加新的DNS区域powershellCopyCodeAdd-DnsServerPrimaryZone-Name"yourdomai......
怎么办？用DolphinScheduler调度执行复杂的HiveSQL时无法正确识别符号
在使用ApacheDolphinScheduler调度执行复杂的HiveSQL时，HQL包含多种海豚无法正确识别的符号，怎么办？本文提供了可行的思路和方法，供用户参考。一、目的在Hive中完成复杂JSON，既有对象还有数组而且数组中包含数组的解析后，原本以为没啥问题了，结果在DolphinScheduler中调度又出现了大问......
Analysis of Code and Test-Code generated by Large Language Models
本文是LLM系列文章，针对《AnalysisofCodeandTest-CodegeneratedbyLargeLanguageModels》的翻译。大型语言模型生成的代码和测试代码的分析摘要1引言2方法3进行实验4测试结果的评估5讨论6相关工作7结论和未来工作摘要ChatGPT和Copilot等......
Imitating Language via Scalable Inverse Reinforcement Learning
本文是LLM系列文章，针对《ImitatingLanguageviaScalableInverseReinforcementLearning》的翻译。通过可扩展的逆向强化学习模仿语言摘要1引言2方法3实验4相关工作5讨论6结论摘要大多数语言模型训练都建立在模仿学习的基础上。它涵盖了预训练、监......
PowerShell 命令来备份 Windows 10 的服务列表：CMD 批处理命令来备份 Windows 10 的服
PowerShell命令来备份Windows10的服务列表：powershellCopyCodeGet-Service|Export-Csv-Path"C:\ServiceListBackup.csv"-NoTypeInformation这条命令会将所有服务信息导出到C:\ServiceListBackup.csv文件中。确保您有写入该路径的权限。CMD批处理命令来备份Windo......

Phi-2: The surprising power of small language models

Phi-2: The surprising power of small language models

Phi-2 Evaluation

相关文章

赞助商

阅读排行