Phi-2: The surprising power of small language models
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
Phi-2 Evaluation
Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).
With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.
Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).
Model Size BBH Commonsense
ReasoningLanguage
UnderstandingMath Coding Llama-2 7B 40.0 62.2 56.7 16.5 21.0 13B 47.8 65.0 61.9 34.2 25.4 70B 66.5 69.2 67.6 64.1 38.3 Mistral 7B 57.2 66.4 63.7 46.4 39.4 Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7 Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
Model Size BBH BoolQ MBPP MMLU Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8 Phi-2 2.7B 59.3 83.3 59.1 56.7 Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.
标签:Phi,Llama,shot,power,language,models,benchmarks From: https://www.cnblogs.com/lightsong/p/18423354