首页 > 其他分享 >LLM大模型: RAG的上下文语义聚类retrieval — GraphaRAG

LLM大模型: RAG的上下文语义聚类retrieval — GraphaRAG

时间:2024-07-05 17:20:37浏览次数:19  
标签:RAG graph chunk GraphaRAG community entities LLM summaries

   截至目前,RAG最大的缺陷就是无法回答总结性的问题了。上篇文章(https://www.cnblogs.com/theseventhson/p/18281227)介绍了RAPTOR方法(利用GMM高斯混合模型对chunk聚类,然后再用LLM对每个cluster概括总结摘要)提取cluster的语义,借此来回答概括、总结性的问题,最核心的步骤就是聚类了:把语义接近的token用GMM聚集到一起,同时利用LLM对这些token做summerise,retrieval的时候先匹配上层的summerise,再匹配下一层包含细节的chunk,做到不同颗粒度都遍历一次,结果滴水不漏!这个思路本身是对的,就连sematic chunk的思路也是这样的,只不过sematic chunk对于chunk聚类的方式不同罢了。顺着这个思路,微软搞出了另一种聚类和retrieve的方式:GraphRAG!先说结论:GraphRAG的核心思路和RAPTOR完全一样:聚类 -> 提取cluster的summerise -> summerise的embedding入库 - > query匹配;最大的区别就在于:GraphRAG是通过命名实体识别NER等方式提取chunk的关键信息,利用knowledge graph组成图,然后用社区发现的算法对图的节点做聚类,以此把语义接近的token聚拢在一起。GraphRAG 具体是怎么做的了?有哪些注意事项?整个详细的流程都在论文公布了:https://arxiv.org/pdf/2404.16130

  整个GraphRAG的流程如下:

      

   1、Source Documents → Text Chunks:A fundamental design decision is the granularity with which input texts extracted from source documents should be split into text chunks for processing. In the following step, each of these chunks will be passed to a set of LLM prompts designed to extract the various elements of a graph index. Longer text chunks require fewer LLM calls for such extraction, but suffer from the recall degradation of longer LLM context windows (Kuratov et al., 2024; Liu et al., 2023). This behavior can be observed in Figure 2 in the case of a single extraction round (i.e., with zero gleanings): on a sample dataset (HotPotQA, Yang et al., 2018), using a chunk size of 600 token extracted almost twice as many entity references as when using a chunk size of 2400. While more references are generally better, any extraction process needs to balance recall and precision for the target activity.

   第一步先把原文档分割,分割的后的文档会用LLM提取命名实体、实体关系等。chunk size过大会导致扣取的实体和关系偏少,召回率降低。这里以chunk size=600和2400分别做测试,600的chunk size召回了两倍的实体!

        2、Text Chunks → Element Instances The baseline requirement for this step is to identify and extract instances of graph nodes and edges from each chunk of source text. We do this using a multipart LLM prompt that first identifies all entities in the text, including their name, type, and description, before identifying all relationships between clearly-related entities, including the source and target entities and a description of their relationship. Both kinds of element instance are output in a single list of delimited tuples. The primary opportunity to tailor this prompt to the domain of the document corpus lies in the choice of few-shot examples provided to the LLM for in-context learning (Brown et al., 2020).For example, while our default prompt extracting the broad class of “named entities” like people, places, and organizations is generally applicable, domains with specialized knowledge (e.g., science, medicine, law) will benefit from few-shot examples specialized to those domains. We also support a secondary extraction prompt for any additional covariates we would like to associate with the extracted node instances. Our default covariate prompt aims to extract claims linked to detected entities, including the subject, object, type, description, source text span, and start and end dates. To balance the needs of efficiency and quality, we use multiple rounds of “gleanings”, up to a specified maximum, to encourage the LLM to detect any additional entities it may have missed on prior extraction rounds. This is a multi-stage process in which we first ask the LLM to assess whether all entities were extracted, using a logit bias of 100 to force a yes/no decision. If the LLM responds that entities were missed, then a continuation indicating that “MANY entities were missed in the last extraction” encourages the LLM to glean these missing entities. This approach allows us to use larger chunk sizes without a drop in quality (Figure 2) or the forced introduction of noise.

        

  • 识别和提取图节点和边实例

    • 从每个源文本块中识别和提取图节点和边的实例。
    • 使用一个LLM提示来完成这项工作。首先识别文本中的所有实体,包括它们的名称、类型和描述;然后识别明确、清晰相关实体之间的所有关系,包括源实体和目标实体以及它们关系的描述。
    • 这两类元素实例输出为一个包含分隔tuple的list。
  • 领域定制prompt

    • 某些领域需要定制prompt,提供few-shot的示例,比如有专业知识的领域(如科学、医学、法律)等
  • secondary extraction prompt

    • 首先提取人名、地名、组织名
    • 其次提取与节点实例有额外关联关系 additional covariates的事务,比如与检测到的实体相关的声明,包括主题、客体、类型、描述、源文本范围以及开始和结束日期。
  • 多轮提取(gleanings)

    • 为了平衡效率和质量,使用多轮“gleanings”提取,直至达到指定的最大轮数,避免重要的entity遗漏。
    • 这是一个多阶段过程,首先让LLM评估是否所有实体都已提取,使用 logit bias of 100 强制做出是/否决定。
    • 如果LLM响应说有实体遗漏,则继续在promt加上 “ MANY entities were missed in the last extraction” ,让LLM提取遗漏的实体。
    • 允许使用更大的文本块大小而不会降低质量(如图2所示)或强行引入噪音。

 

  例如有一篇文章介绍二进制逆向的:

  • 第一轮:提取专业术语,比如 二进制、frida、window、android等
  • 关系识别:二进制逆向包括windows逆向、android逆向
  • 次要提取secondary extraction:专业术语的描述,比如 android逆向是解压apk包,反编译class和so文件,分析其中的重要代码,可以借助frida、ida、android killer等专业工具
  • 多轮提取:识别出前面几轮遗漏的重要实体

    3、Element Instances → Element Summaries The use of an LLM to “extract” descriptions of entities, relationships, and claims represented in source texts is already a form of abstractive summarization, relying on the LLM to create independently meaningful summaries of concepts that may be implied but not stated by the text itself (e.g., the presence of implied relationships). To convert all such instance-level summaries into single blocks of descriptive text for each graph element (i.e., entity node, relationship edge, and claim covariate) requires a further round of LLM summarization over matching groups of instances. A potential concern at this stage is that the LLM may not consistently extract references to the same entity in the same text format, resulting in duplicate entity elements and thus duplicate nodes in the entity graph. However, since all closely-related “communities” of entities will be detected and summarized in the following step, and given that LLMs can understand the common entity behind multiple name variations, our overall approach is resilient to such variations provided there is sufficient connectivity from all variations to a shared set of closely-related entities. Overall, our use of rich descriptive text for homogeneous nodes in a potentially noisy graph structure is aligned with both the capabilities of LLMs and the needs of global, query-focused summarization. These qualities also differentiate our graph index from typical knowledge graphs, which rely on concise and consistent knowledge triples (subject, predicate, object) for downstream reasoning tasks.

  用LLM对已经抽取的 entity node, relationship edge(包含隐藏关系), claim covariate 进一步做summerization,每个single block都做总结概括。这一步通过整合和摘要图元素实例,生成统一的描述性文本,为后续的图社区检测和全局摘要提供了一个详细且连贯的数据基础。

  4、Element Summaries → Graph Communities The index created in the previous step can be modelled as an homogeneous undirected weighted graph in which entity nodes are connected by relationship edges, with edge weights representing the normalized counts of detected relationship instances. Given such a graph, a variety of community detection algorithms may be used to partition the graph into communities of nodes with stronger connections to one another than to the other nodes in the graph (e.g., see the surveys by Fortunato, 2010 and Jin et al., 2021). In our pipeline, we use Leiden (Traag et al., 2019) on account of its ability to recover hierarchical community structure of large-scale graphs efficiently (Figure 3). Each level of this hierarchy provides a community partition that covers the nodes of the graph in a mutually-exclusive, collective-exhaustive way, enabling divide-and-conquer global summarization.  这一步很关键,开始做社团发现了,核心目的还是聚类;

  • 图构建
    • 将前一步中生成的元素摘要构建成一个同质无向加权图 homogeneous undirected weighted graph。
  • 社区检测
    • 使用Leiden算法对图进行社区检测,划分出层次社区结构(最核心的语义聚类已完成)
  • 层次社区划分
    • 利用Leiden算法的层次社区检测能力,生成多个层级的社区划分。
    • 每个层级的划分都是相互排斥的,并且覆盖了图的所有节点。

· 两个不同hop/不同层级communite聚类算法对比:

    

   通过社区检测将图的节点组织成社区,从而为全局摘要和信息聚合提供结构化的基础。

  5、Graph Communities → Community Summaries The next step is to create report-like summaries of each community in the Leiden hierarchy, using a method designed to scale to very large datasets. These summaries are independently useful in their own right as a way to understand the global structure and semantics of the dataset, and may themselves be used to make sense of a corpus in the absence of a question. For example, a user may scan through community summaries at one level looking for general themes of interest, then follow links to the reports at the lower level that provide more details for each of the subtopics. Here, however, we focus on their utility as part of a graph-based index used for answering global queries. Community summaries are generated in the following way:

  • Leaf-level communities. The element summaries of a leaf-level community (nodes, edges, covariates) are prioritized and then iteratively added to the LLM context window until the token limit is reached. The prioritization is as follows: for each community edge in decreasing order of combined source and target node degree (i.e., overall prominance), add descriptions of the source node, target node, linked covariates, and the edge itself.
  • Higher-level communities. If all element summaries fit within the token limit of the context window, proceed as for leaf-level communities and summarize all element summaries within the community. Otherwise, rank sub-communities in decreasing order of element summary tokens and iteratively substitute sub-community summaries (shorter) for their associated element summaries (longer) until fit within the context window is achieved.

   上一步经过社区发现后,对每个node都做了聚类,划分到了合适的社区。为了匹配不同给颗粒度的query,需要对整个社区做summerization了

  • Leaf-level communities:节点、边、covariates等entity按优先级排列,然后逐步添加到LLM的上下文窗口,直到达到token限制。优先级的顺序:对于每条边,按源节点和目标节点度数降序排列(度数越大,越重要),依次添加源节点、目标节点、 linked covariates及边的描述
  •  Higher-level communities:如果所有元素的summaries都能在context的token限制内,则与Leaf-level communities处理方式相同,汇总社区内的所有元素的summaries。如果超出了context的token限制,则按element summary的token数降序排列子社区,并逐步用子社区摘要(较短)替换其关联的元素摘要(较长),直到符合context的限制

  为啥要分层级提取summerise?还是为了匹配用户的query。用户可以浏览某个层级的社区summerise以寻找感兴趣的主题,然后通过链接查看下一级别的报告,获取每个子社区的更多详细信息。

  通过生成communite summerise ,使用户能够快速了解每个社区的内容和结构。这些summerise 不仅在没有具体问题的情况下有助于理解数据集,还在回答全局查询时发挥重要作用。通过优先级排序和摘要生成,确保所有内容都在上下文窗口限制内,从而提供详细和有用的社区报告。

  6、Community Summaries → Community Answers → Global Answer  Given a user query, the community summaries generated in the previous step can be used to generate a final answer in a multi-stage process. The hierarchical nature of the community structure also means that questions can be answered using the community summaries from different levels, raising the question of whether a particular level in the hierarchical community structure offers the best balance of summary detail and scope for general sensemaking questions (evaluated in section 3). For a given community level, the global answer to any user query is generated as follows:

  • Prepare community summaries. Community summaries are randomly shuffled and divided into chunks of pre-specified token size. This ensures relevant information is distributed across chunks, rather than concentrated (and potentially lost) in a single context window.
  • Map community answers. Generate intermediate answers in parallel, one for each chunk. The LLM is also asked to generate a score between 0-100 indicating how helpful the generated answer is in answering the target question. Answers with score 0 are filtered out.
  • Reduce to global answer. Intermediate community answers are sorted in descending order of helpfulness score and iteratively added into a new context window until the token limit is reached. This final context is used to generate the global answer returned to the user.

  利用前一步生成的不同层次、 不同颗粒度的Community Summaries 回答用户的query,最终生成Global Answer。生成gloal answer的步骤:

  • 将community summaries随机打乱,并按预先指定的大小划分为多个块。目的是确保相关信息分布在多个块中,而不是集中在单个上下文窗口中,这样可以避免重要信息的丢失
  • 并行生成每个chunk的中间答案,并让LLM生成一个0到100的分数,表示生成的答案在回答目标问题时的helpful程度,去掉0分的回答
  • 按helpful分数降序排列Intermediate community answers,并逐个将它们添加到新的上下文窗口,直到达到token限制,最终使用这个context生成global answer。

 

  最后,最新的RAG总结如下:

 

 

 

参考:

1、https://www.microsoft.com/en-us/research/project/graphrag/     https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

2、https://www.bilibili.com/video/BV1q7421o72E/?spm_id_from=333.788&vd_source=241a5bcb1c13e6828e519dd1f78f35b2   如何使用知识图谱(knowledge graph)做大模型RAG增强

3、https://arxiv.org/pdf/2404.16130    https://www.microsoft.com/en-us/research/project/graphrag/

4、https://jeongiitae.medium.com/from-rag-to-graphrag-what-is-the-graphrag-and-why-i-use-it-f75a7852c10c   From RAG to GraphRAG , What is the GraphRAG and why i use it?

标签:RAG,graph,chunk,GraphaRAG,community,entities,LLM,summaries
From: https://www.cnblogs.com/theseventhson/p/18285058

相关文章

  • AI Agent框架(LLM Agent):LLM驱动的智能体如何引领行业变革,应用探索与未来展望
    AIAgent框架(LLMAgent):LLM驱动的智能体如何引领行业变革,应用探索与未来展望1.AIAgent(LLMAgent)介绍1.1.术语Agent:“代理”通常是指有意行动的表现。在哲学领域,Agent可以是人、动物,甚至是具有自主性的概念或实体。AIAgent:AIAgent(人工智能代理)是一种能够感知环境、进行......
  • RouteLLM:高效LLM路由框架,可以动态选择优化成本与响应质量的平衡
    该论文提出了一个新的框架,用于在强模型和弱模型之间进行查询路由选择。通过学习用户偏好数据,预测强模型获胜的概率,并根据成本阈值来决定使用哪种模型处理查询。该研究主要应用于大规模语言模型(LLMs)的实际部署中,通过智能路由在保证响应质量的前提下显著降低成本。通过创新的路由......
  • 【大模型】大模型提示词工程与RAG:异同解析
    大模型提示词工程与RAG:异同解析大模型提示词工程与RAG:异同解析引言一、提示词工程:赋予模型指导的艺术1.1定义与概念1.2实现原理1.3应用案例二、RAG:检索与生成的智慧融合2.1定义与概念2.2实现原理2.3应用案例三、比较与分析3.1相同之处3.2不同之处四、实践考量......
  • 哪些工具不会被LLM替代,及具身机器人技术相结合的情况
     LLM(大型语言模型)虽然在处理自然语言理解和生成任务上表现出色,但仍然存在局限性,某些领域和工具因其专业性、实时性、或需要身体操作的特性,难以被LLM完全替代。 以下是一些例子:1.专业软件和工具:如CAD软件(计算机辅助设计)用于精密工程制图、3D建模;编程IDE(集成开发环境)如Visu......
  • LLM的成功改变了人类理解世界的方式
    在科技领域的飞速发展中,深度学习的崛起无疑是一场革命性的进步。在LLM取得成功之前,深度学习在感知领域取得的成就虽然也很轰动,但是并不那么令人惊讶,毕竟感知是相对简单的领域。然而,LLM在语言这项被认为是人类智慧结晶的领域也能取得如此重大的突破,深刻改变了人类理解世界的......
  • ​RAG与LLM原理及实践(8)--- Chroma 应用场景及限制
    前言通过前面几节的介绍,你应该对Chroma的运作原理有相当透彻的理解。Chroma的设计正如之前描述的:Chroma提供的工具:存储文档数据和它们的元数据:storeembeddingsandtheirmetadata嵌入:embeddocumentsandqueries搜索:searchembeddingsChroma在设计上优先考虑:足够简......
  • LLM指令微调Prompt的最佳实践(三):编写文本摘要的Prompt
    文章目录1.前言2.Prompt定义3.如何编写文本摘要的Prompt3.1对于单一文本的摘要3.1.1限制摘要的输出长度3.1.2强调关注的重点3.1.3提取关键信息3.2对于多条文本的摘要4.参考1.前言前情提要:《LLM指令微调Prompt的最佳实践(一):Prompt原则》《LLM指令微调P......
  • LLM大模型: RAG的上下文语义retrieval — RAPTOR
    1、RAG有很多优点,但是缺点也很明显,其中一个硬伤就是:只会“断章取义”!如果文本有明确的答案,现有的retrieve方式大概率能找到,但是如果文本没有明确的答案了?比如android逆向和windows逆向,哪个更有钱途?很多资料会介绍android逆向、windows逆向的原理和技术细节,但是没有哪个更有钱......
  • RAG 案框架(Qanything、RAGFlow、FastGPT、智谱RAG)对比
    各家的技术方案有道的QAnything亮点在:rerankRAGFLow亮点在:数据处理+index智谱AI亮点在文档解析、切片、query改写及recall模型的微调FastGPT优点:灵活性更高下面分别按照模块比较各框架的却别功能模块QAnythingRAGFLowFastGPT智谱AI知识处理模块pdf文件解析是......
  • LLM应用:推荐系统
    随着信息的不断丰富,搜索和推荐成为了我们日常最长用到的两个功能,搜索是用户主动发起的信息查找需求,推荐则是平台根据用户标签/行为或用户query推荐给用户信息,用户是被动消费内容。比如在百度上搜索“周杰伦”时,搜索结果会给你推荐“大家都在搜”和“相关推荐”的query;再比如在......