大模型的能力来源

https://arxiv.org/pdf/2402.18041

大模型合规来源

https://arxiv.org/html/2402.12193v2

大模型的罪恶检测来源

https://www.kaggle.com/datasets/odins0n/ucf-crime-dataset/data

code math

https://github.com/mlabonne/llm-datasets

Math & Logic

LLMs often struggle with mathematical reasoning and formal logic, which has led to the creation of specialized datasets. These datasets extend beyond pure mathematics, encompassing a wide range of problems that require systematic thinking and step-by-step reasoning, ultimately enabling LLMs to tackle complex real-world challenges that involve logical deduction and quantitative analysis.

Dataset	#	Authors	Date	Notes
OpenMathInstruct-1	5.75M	Toshniwal et al.	Feb 2024	Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B
MetaMathQA	395k	Yu et al.	Dec 2023	Bootstrap mathematical questions by rewriting them from multiple perspectives. See MetaMath paper.
MathInstruct	262k	Yue et al.	Sep 2023	Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought.
Orca-Math	200k	Mitra et al.	Feb 2024	Grade school math world problems generated using GPT4-Turbo. See Orca-Math paper.

Code

Code is another challenging domain for LLMs that lack specialized pre-training. Code datasets, containing diverse programming language examples, are used to fine-tune LLMs and enhance their ability to understand, generate, and analyze code, enabling them to serve as effective coding assistants.

Dataset	#	Authors	Date	Notes
CodeFeedback-Filtered-Instruction	157k	Zheng et al.	Feb 2024	Filtered version of Magicoder-OSS-Instruct, ShareGPT (Python), Magicoder-Evol-Instruct, and Evol-Instruct-Code.
Tested-143k-Python-Alpaca	143k	Vezora	Mar 2024	Collection of generated Python code that passed automatic tests to ensure high quality.
glaive-code-assistant	136k	Glaive.ai	Sep 2023	Synthetic data of problems and solutions with ~60% Python samples. Also see the v2 version.
Magicoder-Evol-Instruct-110K	110k	Wei et al.	Nov 2023	A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process). See Magicoder paper.
dolphin-coder	109k	Eric Hartford	Nov 2023	Dataset transformed from leetcode-rosetta.
synthetic_tex_to_sql	100k	Gretel.ai	Apr 2024	Synthetic text-to-SQL samples (~23M tokens), covering diverse domains.
sql-create-context	78.6k	b-mc2	Apr 2023	Cleansed and augmented version of the WikiSQL and Spider datasets.
Magicoder-OSS-Instruct-75K	75k	Wei et al.	Nov 2023	OSS-Instruct dataset generated by `gpt-3.5-turbo-1106`. See Magicoder paper.
Code-Feedback	66.4k	Zheng et al.	Feb 2024	Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See OpenCodeInterpreter paper.
Open-Critic-GPT	55.1k	Vezora	Jul 2024	Use a local model to create, introduce, and identify bugs in code across multiple programming languages.
self-oss-instruct-sc2-exec-filter-50k	50.7k	Lozhkov et al.	Apr 2024	Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the blog post.

标签：datasets,al,DATASET,2024,code,LLM,2023,et
From： https://www.cnblogs.com/lightsong/p/18423409

A星、Floyod、Bellman-Ford
A星算法A星和Dijkstra算法唯一区别在于堆中排序的依据。distance数组仍然保存实际代价，预估代价只影响堆的弹出顺序。Dijkstra根据源点到当前点的实际代价进行排序。A星根据源点到当前点的实际代价+当前点到终点的预估代价进行排序预估函数要求：当前点到终点的预......
LLM基础概念：大模型参数到底是什么？作用是什么？
对于大模型及相关应用的测试同学来说，掌握大模型的参数概念及作用，以及调参非常重要，不然的话，在测试中面对模型的一顿输出，我们满脸的懵逼......
【大语言模型（LLM）智能体】
目录大语言模型智能体框架简介智能体规划无反馈规划有反馈的规划内存工具大语言模型智能体的应用领域编辑著名的大语言模型智能体大语言模型智能体工具编辑大语言模型智能体的评估编辑挑战参考资料大语言模型（LLM）智能体，是一种利用大语言模型进行复杂任......
qwen2.5 vllm推理；openai function call调用中文离线agents使用
参考：https://qwenlm.github.io/zh/blog/qwen2.5/https://qwen.readthedocs.io/zh-cn/latest/framework/function_call.html#vllm安装：pipinstall-Uvllm-ihttps://pypi.tuna.tsinghua.edu.cn/simplevllm-0.6.1.post2运行：</......
1-bit 大模型（LLM）时代的到来
人工智能咨询培训老师叶梓转载标明出处模型规模的扩大带来了部署上的挑战，并因其高能耗引对环境和经济产生了影响。为了应对这些挑战，研究者们开始探索使用低位宽量化技术来降低模型的推理成本，同时保持模型性能。微软公司和中国科学院大学的研究团队提出了一种名为BitNetb1.......
LLM - 理解多模态大语言模型(MLLM) 的评估(Evaluation) 与相关技术 (六)
欢迎关注我的CSDN：https://spike.blog.csdn.net/本文地址：https://spike.blog.csdn.net/article/details/142364884免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。评估(Evaluation)是研发多模态大语言模型(MLLM)的重要部分，也为模型的优化提......
LLM学习笔记-长度外推技术
长度外推为在不需要对模型进行额外训练的情况下，模型可以处理更长的序列。本篇文章主要介绍目前大模型用到的一些长度外推技术，包括以RoPE为基础进行位置插值、NTK-aware、动态NTK、NTK-by-parts和YaRN。关于RoPE，可参见我的上一篇博客LLM学习笔记-位置编码篇位置插值回想一下Tran......
2025秋招LLM大模型多模态面试题（六）-KV缓存
目录为什么Transformer推理需要KV缓存？KV缓存的具体实现没有缓存的情况下使用缓存的情况下KV缓存在解码中的阶段划分Prefil阶段Decoding阶段KV缓存的存储类型及显存占用计算KV缓存的局限与优化策略超长文本与复杂模型场景下的瓶颈量化方案的应用量化方......
2025秋招LLM大模型多模态面试题（七）- 思维链CoT
1.思维链（cot）论文名称：Chain-of-ThoughtPromptingElicitsReasoninginLargeLanguageModels论文连接：Chain-of-ThoughtPromptingElicitsReasoninginLargeLanguageModels1.什么是思维链提示？思维链(CoT)提示过程是一种最近开发的提示方法，它鼓励大语言模型解释其......
LLMChat入门指南 - 基于Flutter和FastAPI的大语言模型聊天应用
LLMChat-您的AI聊天助手......

LLM DATASET