首页 > 其他分享 >LLM DATASET

LLM DATASET

时间:2024-09-20 22:34:57浏览次数:1  
标签:datasets al DATASET 2024 code LLM 2023 et

大模型的能力来源

https://arxiv.org/pdf/2402.18041

 

 

大模型合规来源

https://arxiv.org/html/2402.12193v2

 

 

 

 

大模型的罪恶检测来源

https://www.kaggle.com/datasets/odins0n/ucf-crime-dataset/data

 

 

code math

https://github.com/mlabonne/llm-datasets

 

Math & Logic

LLMs often struggle with mathematical reasoning and formal logic, which has led to the creation of specialized datasets. These datasets extend beyond pure mathematics, encompassing a wide range of problems that require systematic thinking and step-by-step reasoning, ultimately enabling LLMs to tackle complex real-world challenges that involve logical deduction and quantitative analysis.

Dataset#AuthorsDateNotes
OpenMathInstruct-1 5.75M Toshniwal et al. Feb 2024 Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B
MetaMathQA 395k Yu et al. Dec 2023 Bootstrap mathematical questions by rewriting them from multiple perspectives. See MetaMath paper.
MathInstruct 262k Yue et al. Sep 2023 Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought.
Orca-Math 200k Mitra et al. Feb 2024 Grade school math world problems generated using GPT4-Turbo. See Orca-Math paper.

Code

Code is another challenging domain for LLMs that lack specialized pre-training. Code datasets, containing diverse programming language examples, are used to fine-tune LLMs and enhance their ability to understand, generate, and analyze code, enabling them to serve as effective coding assistants.

Dataset#AuthorsDateNotes
CodeFeedback-Filtered-Instruction 157k Zheng et al. Feb 2024 Filtered version of Magicoder-OSS-Instruct, ShareGPT (Python), Magicoder-Evol-Instruct, and Evol-Instruct-Code.
Tested-143k-Python-Alpaca 143k Vezora Mar 2024 Collection of generated Python code that passed automatic tests to ensure high quality.
glaive-code-assistant 136k Glaive.ai Sep 2023 Synthetic data of problems and solutions with ~60% Python samples. Also see the v2 version.
Magicoder-Evol-Instruct-110K 110k Wei et al. Nov 2023 A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process). See Magicoder paper.
dolphin-coder 109k Eric Hartford Nov 2023 Dataset transformed from leetcode-rosetta.
synthetic_tex_to_sql 100k Gretel.ai Apr 2024 Synthetic text-to-SQL samples (~23M tokens), covering diverse domains.
sql-create-context 78.6k b-mc2 Apr 2023 Cleansed and augmented version of the WikiSQL and Spider datasets.
Magicoder-OSS-Instruct-75K 75k Wei et al. Nov 2023 OSS-Instruct dataset generated by gpt-3.5-turbo-1106. See Magicoder paper.
Code-Feedback 66.4k Zheng et al. Feb 2024 Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See OpenCodeInterpreter paper.
Open-Critic-GPT 55.1k Vezora Jul 2024 Use a local model to create, introduce, and identify bugs in code across multiple programming languages.
self-oss-instruct-sc2-exec-filter-50k 50.7k Lozhkov et al. Apr 2024 Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the blog post.

 

 

标签:datasets,al,DATASET,2024,code,LLM,2023,et
From: https://www.cnblogs.com/lightsong/p/18423409

相关文章

  • A星、Floyod、Bellman-Ford
    A星算法A星和Dijkstra算法唯一区别在于堆中排序的依据。distance数组仍然保存实际代价,预估代价只影响堆的弹出顺序。Dijkstra根据源点到当前点的实际代价进行排序。A星根据源点到当前点的实际代价+当前点到终点的预估代价进行排序预估函数要求:当前点到终点的预......
  • LLM基础概念:大模型参数到底是什么?作用是什么?
        对于大模型及相关应用的测试同学来说,掌握大模型的参数概念及作用,以及调参非常重要,不然的话,在测试中面对模型的一顿输出,我们满脸的懵逼......
  • 【大语言模型(LLM)智能体】
    目录大语言模型智能体框架简介​智能体规划无反馈规划有反馈的规划内存工具大语言模型智能体的应用领域​编辑著名的大语言模型智能体大语言模型智能体工具​编辑大语言模型智能体的评估​编辑挑战参考资料大语言模型(LLM)智能体,是一种利用大语言模型进行复杂任......
  • qwen2.5 vllm推理;openai function call调用中文离线agents使用
    参考:https://qwenlm.github.io/zh/blog/qwen2.5/https://qwen.readthedocs.io/zh-cn/latest/framework/function_call.html#vllm安装:pipinstall-Uvllm-ihttps://pypi.tuna.tsinghua.edu.cn/simplevllm-0.6.1.post2运行:</......
  • 1-bit 大模型(LLM)时代的到来
     人工智能咨询培训老师叶梓转载标明出处模型规模的扩大带来了部署上的挑战,并因其高能耗引对环境和经济产生了影响。为了应对这些挑战,研究者们开始探索使用低位宽量化技术来降低模型的推理成本,同时保持模型性能。微软公司和中国科学院大学的研究团队提出了一种名为BitNetb1.......
  • LLM - 理解 多模态大语言模型(MLLM) 的 评估(Evaluation) 与相关技术 (六)
    欢迎关注我的CSDN:https://spike.blog.csdn.net/本文地址:https://spike.blog.csdn.net/article/details/142364884免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。评估(Evaluation)是研发多模态大语言模型(MLLM)的重要部分,也为模型的优化提......
  • LLM学习笔记-长度外推技术
    长度外推为在不需要对模型进行额外训练的情况下,模型可以处理更长的序列。本篇文章主要介绍目前大模型用到的一些长度外推技术,包括以RoPE为基础进行位置插值、NTK-aware、动态NTK、NTK-by-parts和YaRN。关于RoPE,可参见我的上一篇博客LLM学习笔记-位置编码篇位置插值回想一下Tran......
  • 2025秋招LLM大模型多模态面试题(六)-KV缓存
    目录为什么Transformer推理需要KV缓存?KV缓存的具体实现没有缓存的情况下使用缓存的情况下KV缓存在解码中的阶段划分Prefil阶段Decoding阶段KV缓存的存储类型及显存占用计算KV缓存的局限与优化策略超长文本与复杂模型场景下的瓶颈量化方案的应用量化方......
  • 2025秋招LLM大模型多模态面试题(七)- 思维链CoT
    1.思维链(cot)论文名称:Chain-of-ThoughtPromptingElicitsReasoninginLargeLanguageModels论文连接:Chain-of-ThoughtPromptingElicitsReasoninginLargeLanguageModels1.什么是思维链提示?思维链(CoT)提示过程是一种最近开发的提示方法,它鼓励大语言模型解释其......
  • LLMChat入门指南 - 基于Flutter和FastAPI的大语言模型聊天应用
    LLMChat-您的AI聊天助手......