微软GraphRAG框架源码解读（LLMs）

标签：pipeline run GraphRAG workflow LLMs graph create py 源码

1. 引言

这几天微软开源了一个新的基于知识图谱构建的检索增强生成（RAG）系统：GraphRAG。该框架旨在利用大型语言模型（LLMs）从非结构化文本中提取结构化数据，构建具有标签的知识图谱，以支持数据集问题生成、摘要问答等多种应用场景。GraphRAG的一大特色是利用图机器学习算法针对数据集进行语义聚合和层次化分析，因而可以回答一些相对高层级的抽象或总结性问题，这一点恰好是常规RAG系统的短板。说实话之前一直有在关注这个框架，所以这两天花了点时间研究了一下源码，结合之前的一些技术文档，本文主要是记录GraphRAG源码方面的一些解读，也希望借此进一步理解其系统架构、关键概念以及核心工作流等。

本次拉取的GraphRAG项目源码对应commit ID为a22003c302bf4ffeefec76a09533acaf114ae7bb，更新日期为2024.07.05。

2. 框架概述

2.1 解决了什么问题（What & Why）?

讨论代码前，我们先简单了解下GraphRAG项目的目标与定位. 在论文中，作者很明确地提出了一个常规RAG无法处理的应用场景：

However，RAG fails on global questions directed at an entire text corpus，such as “What are the main themes in the dataset?”，since this is inherently a queryfocused summarization (QFS) task，rather than an explicit retrieval task.

也就是类似该数据集的主题是什么这种high level的总结性问题，作者认为，这种应用场景本质上一种聚焦于查询的总结性(QueryFocused Summarization，QFS)任务，单纯只做数据检索是无法解决的. 相应的，其解决思路也在论文中清楚地描述出来了：

In contrast with related work that exploits the structured retrieval and traversal affordances of graph indexes (subsection 4.2)，we focus on a previously unexplored quality of graphs in this context: their inherent modularity (Newman，2006) and the ability of community detection algorithms to partition graphs into modular communities of closely-related nodes (e.g.，Louvain，Blondel et al.，2008; Leiden，Traag et al.，2019). LLM-generated summaries of these community descriptions provide complete coverage of the underlying graph index and the input documents it represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: first using each community summary to answer the query independently and in parallel，then summarizing all relevant partial answers into a final global answer.

利用社区检测算法（如Leiden算法）将整个知识图谱划分模块化的社区(包含相关性较高的节点)，然后大模型自下而上对社区进行摘要，最终再采取map-reduce方式实现QFS: 每个社区先并行执行Query，最终汇总成全局性的完整答案.

2.2 实现方式是什么（How）?

在这里插入图片描述

论文中给出了解决问题的基本思路，与其他RAG系统类似，GraphRAG整个Pipeline也可划分为索引(Indexing)与查询(Query)两个阶段。索引过程利用LLM提取出节点（如实体）、边（如关系）和协变量（如 claim），然后利用社区检测技术对整个知识图谱进行划分，再利用LLM进一步总结。最终针对特定的查询，可以汇总所有与之相关的社区摘要生成一个全局性的答案。

3. 源码解析

官方文档说实话写得已经很清楚了，不过想要理解一些实现上的细节，还得深入到源码当中. 接下来，一块看下代码的具体实现. 项目源码结构树如下：

├── cache
├── config
├── emit
├── graph
│   ├── embedding
│   ├── extractors
│   │   ├── claims
│   │   ├── community_reports
│   │   ├── graph
│   │   └── summarize
│   ├── utils
│   └── visualization
├── input
├── llm
├── progress
├── reporting
├── storage
├── text_splitting
├── utils
├── verbs
│   ├── covariates
│   │   └── extract_covariates
│   │       └── strategies
│   │           └── graph_intelligence
│   ├── entities
│   │   ├── extraction
│   │   │   └── strategies
│   │   │       └── graph_intelligence
│   │   └── summarize
│   │       └── strategies
│   │           └── graph_intelligence
│   ├── graph
│   │   ├── clustering
│   │   │   └── strategies
│   │   ├── embed
│   │   │   └── strategies
│   │   ├── layout
│   │   │   └── methods
│   │   ├── merge
│   │   └── report
│   │       └── strategies
│   │           └── graph_intelligence
│   ├── overrides
│   └── text
│       ├── chunk
│       │   └── strategies
│       ├── embed
│       │   └── strategies
│       ├── replace
│       └── translate
│           └── strategies
└── workflows
    └── v1

3.1 Demo

研究具体功能前，先简单跑下官方demo，上手也很简单，直接参考Get Started (microsoft.github.io) 即可。
高能预警: 虽然只是一个简单demo，但是Token消耗可是一点都不含糊，尽管早有预期，并且提前删除了原始文档超过一半的内容，不过我这边完整跑下来还是花了差不多3刀费用，官方完整demo文档跑一遍，预计得消耗5~10刀。

这里实际运行时间还是比较慢的，大模型实际上是来来回回的在过整个文档，其中一些比较重要的事项如下：

├── cache
│   ├── community_reporting
│   │   ├── create_community_report-chat-v2-0d811a75c6decaf2b0dd7b9edff02389
│   │   ├── create_community_report-chat-v2-1205bcb6546a4379cf7ee841498e5bd4
│   │   ├── create_community_report-chat-v2-1445bd6d097492f734b06a09e579e639
│   │   ├── ...
│   ├── entity_extraction
│   │   ├── chat-010c37f5f6dedff6bd4f1f550867e4ee
│   │   ├── chat-017a1f05c2a23f74212fd9caa4fb7936
│   │   ├── chat-09095013f2caa58755e8a2d87eb66fc1
│   │   ├── ...
│   ├── summarize_descriptions
│   │   ├── summarize-chat-v2-00e335e395c5ae2355ef3185793b440d
│   │   ├── summarize-chat-v2-01c2694ab82c62924080f85e8253bb0a
│   │   ├── summarize-chat-v2-03acd7bc38cf2fb24b77f69b016a288a
│   │   ├── ...
│   └── text_embedding
│       ├── embedding-07cb902a76a26b6f98ca44c17157f47f
│       ├── embedding-3e0be6bffd1c1ac6a091f5264858a2a1
│       ├── ...
├── input
│   └── book.txt
├── output
│   └── 20240705-142536
│       ├── artifacts
│       │   ├── create_base_documents.parquet
│       │   ├── create_base_entity_graph.parquet
│       │   ├── create_base_extracted_entities.parquet
│       │   ├── create_base_text_units.parquet
│       │   ├── create_final_communities.parquet
│       │   ├── create_final_community_reports.parquet
│       │   ├── create_final_documents.parquet
│       │   ├── create_final_entities.parquet
│       │   ├── create_final_nodes.parquet
│       │   ├── create_final_relationships.parquet
│       │   ├── create_final_text_units.parquet
│       │   ├── create_summarized_entities.parquet
│       │   ├── join_text_units_to_entity_ids.parquet
│       │   ├── join_text_units_to_relationship_ids.parquet
│       │   └── stats.json
│       └── reports
│           ├── indexing-engine.log
│           └── logs.json
├── prompts
│   ├── claim_extraction.txt
│   ├── community_report.txt
│   ├── entity_extraction.txt
│   └── summarize_descriptions.txt
└── settings.yaml

这个文件中的很多文档都值得仔细研究，后续将结合代码详细说明。

此外，console中会打印很多运行日志，其中比较重要的一条就是完整的workflows，会涉及到完整pipeline的编排：

⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents

标签：pipeline,run,GraphRAG,workflow,LLMs,graph,create,py,源码	

From： https://blog.csdn.net/python12345_/article/details/140841007