首页 > 其他分享 >RAG PAPTOR 示例代码理解笔记

RAG PAPTOR 示例代码理解笔记

时间:2024-06-12 19:30:31浏览次数:13  
标签:RAG embeddings 函数 示例 df cluster 聚类 clusters PAPTOR

RAG PAPTOR 示例代码理解笔记

0. 源代码文件

https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb

1. 部分代码理解笔记

我们可以用一个简单的故事和易于理解的语言来解释每一部分。我们假设我们要处理一堆文本(比如一些文章),然后把它们分类成不同的类别,并总结每个类别的内容。

故事背景

假设我们有很多关于不同主题的文章,比如“动物”、“体育”、“科技”等。我们希望自动地把这些文章分成不同的类别,并总结每个类别的主要内容。

导入工具

首先,我们需要一些工具来帮助我们完成这项任务。就像我们要建一座房子需要锤子和钉子一样,我们在代码里导入了一些工具:

from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import umap
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from sklearn.mixture import GaussianMixture

这些工具会帮助我们处理数据、做数学计算、生成文本摘要等等。

固定种子(随机种子)

我们设置了一个固定的种子,这样每次运行代码时,结果都是一样的,便于重复和验证。

RANDOM_SEED = 224

全局降维函数

这个函数叫 global_cluster_embeddings,它的作用是把高维的嵌入向量(代表文本的数学表示)降到一个低维空间,方便后续处理。

def global_cluster_embeddings(
    embeddings: np.ndarray,
    dim: int,
    n_neighbors: Optional[int] = None,
    metric: str = "cosine",
) -> np.ndarray:
    """
    使用UMAP进行全局降维
    """
    if n_neighbors is None:
        n_neighbors = int((len(embeddings) - 1) ** 0.5)
    return umap.UMAP(
        n_neighbors=n_neighbors, n_components=dim, metric=metric
    ).fit_transform(embeddings)

局部降维函数

这个函数叫 local_cluster_embeddings,它和上面的函数类似,但通常在全局降维之后使用,以进一步细化分类。

def local_cluster_embeddings(
    embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = "cosine"
) -> np.ndarray:
    """
    使用UMAP进行局部降维
    """
    return umap.UMAP(
        n_neighbors=num_neighbors, n_components=dim, metric=metric
    ).fit_transform(embeddings)

获取最佳聚类数函数

这个函数叫 get_optimal_clusters,它用来找出最适合的聚类数量(把数据分成几组)。

def get_optimal_clusters(
    embeddings: np.ndarray, max_clusters: int = 50, random_state: int = RANDOM_SEED
) -> int:
    """
    使用高斯混合模型和贝叶斯信息准则来确定最佳聚类数
    """
    max_clusters = min(max_clusters, len(embeddings))
    n_clusters = np.arange(1, max_clusters)
    bics = []
    for n in n_clusters:
        gm = GaussianMixture(n_components=n, random_state=random_state)
        gm.fit(embeddings)
        bics.append(gm.bic(embeddings))
    return n_clusters[np.argmin(bics)]

GMM聚类函数

这个函数叫 GMM_cluster,它用高斯混合模型(GMM)对数据进行聚类。

def GMM_cluster(embeddings: np.ndarray, threshold: float, random_state: int = 0):
    """
    使用GMM进行聚类,并基于概率阈值进行分配
    """
    n_clusters = get_optimal_clusters(embeddings)
    gm = GaussianMixture(n_components=n_clusters, random_state=random_state)
    gm.fit(embeddings)
    probs = gm.predict_proba(embeddings)
    labels = [np.where(prob > threshold)[0] for prob in probs]
    return labels, n_clusters

执行聚类函数

这个函数叫 perform_clustering,它结合了全局降维、GMM聚类和局部降维聚类的步骤,把所有的文本分成不同的类别。

def perform_clustering(
    embeddings: np.ndarray,
    dim: int,
    threshold: float,
) -> List[np.ndarray]:
    """
    对嵌入向量进行全局降维、GMM聚类和局部降维聚类
    """
    if len(embeddings) <= dim + 1:
        return [np.array([0]) for _ in range(len(embeddings))]

    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
    global_clusters, n_global_clusters = GMM_cluster(
        reduced_embeddings_global, threshold
    )

    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
    total_clusters = 0

    for i in range(n_global_clusters):
        global_cluster_embeddings_ = embeddings[
            np.array([i in gc for gc in global_clusters])
        ]

        if len(global_cluster_embeddings_) == 0:
            continue
        if len(global_cluster_embeddings_) <= dim + 1:
            local_clusters = [np.array([0]) for _ in global_cluster_embeddings_]
            n_local_clusters = 1
        else:
            reduced_embeddings_local = local_cluster_embeddings(
                global_cluster_embeddings_, dim
            )
            local_clusters, n_local_clusters = GMM_cluster(
                reduced_embeddings_local, threshold
            )

        for j in range(n_local_clusters):
            local_cluster_embeddings_ = global_cluster_embeddings_[
                np.array([j in lc for lc in local_clusters])
            ]
            indices = np.where(
                (embeddings == local_cluster_embeddings_[:, None]).all(-1)
            )[1]
            for idx in indices:
                all_local_clusters[idx] = np.append(
                    all_local_clusters[idx], j + total_clusters
                )

        total_clusters += n_local_clusters

    return all_local_clusters

嵌入函数

这个函数叫 embed,它把文本转成嵌入向量。

def embed(texts):
    """
    为文本生成嵌入向量
    """
    text_embeddings = embd.embed_documents(texts)
    text_embeddings_np = np.array(text_embeddings)
    return text_embeddings_np

嵌入并聚类文本函数

这个函数叫 embed_cluster_texts,它先嵌入文本,再进行聚类,并把结果存到一个数据框里。

def embed_cluster_texts(texts):
    """
    嵌入文本并聚类,返回包含文本、嵌入向量和聚类标签的数据框
    """
    text_embeddings_np = embed(texts)
    cluster_labels = perform_clustering(
        text_embeddings_np, 10, 0.1
    )
    df = pd.DataFrame()
    df["text"] = texts
    df["embd"] = list(text_embeddings_np)
    df["cluster"] = cluster_labels
    return df

格式化文本函数

这个函数叫 fmt_txt,它把数据框里的文本格式化成一个字符串。

def fmt_txt(df: pd.DataFrame) -> str:
    """
    格式化文本
    """
    unique_txt = df["text"].tolist()
    return "--- --- \n --- --- ".join(unique_txt)

嵌入、聚类并总结文本函数

这个函数叫 embed_cluster_summarize_texts,它嵌入、聚类并总结文本,返回两个数据框,一个包含文本及其聚类标签,另一个包含每个聚类的总结。

def embed_cluster_summarize_texts(
    texts: List[str], level: int
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    嵌入、聚类并总结文本
    """
    df_clusters = embed_cluster_texts(texts)
    expanded_list = []
    for index, row in df_clusters.iterrows():
        for cluster in row["cluster"]:
            expanded_list.append(
                {"text": row["text"], "embd": row["embd"], "cluster": cluster}
            )
    expanded_df = pd.DataFrame(expanded_list)
    all_clusters = expanded_df["cluster"].unique()

    template = """Here is a sub-set of LangChain Expression Language doc. 
    
    LangChain Expression Language provides a way to compose chain in LangChain.
    
    Give a detailed summary of the documentation provided.
    
    Documentation:
    {context}
    """
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model | StrOutputParser()

    summaries = []
    for i in all_clusters:
        df_cluster = expanded_df[expanded_df["cluster"] == i]
        formatted_txt = fmt_txt(df_cluster)
        summaries.append(chain.invoke({"context": formatted_txt}))

    df_summary = pd.DataFrame(
        {
            "summaries": summaries,
            "level": [level] * len(summaries),
            "cluster": list(all_clusters),
        }
    )

    return df_clusters, df_summary

递归嵌入、聚类并总结函数

这个函数叫 `recursive_embed_cluster_summar

ize`,它递归地进行嵌入、聚类并总结,直到达到指定的层级或聚类数变为1。

def recursive_embed_cluster_summarize(
    texts: List[str], level: int = 1, n_levels: int = 3
) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
    """
    递归嵌入、聚类并总结文本
    """
    results = {}
    df_clusters, df_summary = embed_cluster_summarize_texts(texts, level)
    results[level] = (df_clusters, df_summary)

    unique_clusters = df_summary["cluster"].nunique()
    if level < n_levels and unique_clusters > 1:
        new_texts = df_summary["summaries"].tolist()
        next_level_results = recursive_embed_cluster_summarize(
            new_texts, level + 1, n_levels
        )
        results.update(next_level_results)

    return results

总结

这段代码的总体思路是:

  1. 把文本转成嵌入向量。
  2. 对嵌入向量进行全局降维。
  3. 使用GMM模型对降维后的嵌入向量进行聚类。
  4. 对每个全局聚类的结果进行进一步的局部降维和聚类。
  5. 对每个聚类的文本进行总结。
  6. 如果还有更多层级,递归地对每个总结进行嵌入、聚类和总结,直到达到指定的层数或聚类数变为1。

标签:RAG,embeddings,函数,示例,df,cluster,聚类,clusters,PAPTOR
From: https://blog.csdn.net/engchina/article/details/139621557

相关文章

  • etcd错误:Failed to defragment etcd member[127.0.0.1:2379] (context deadline excee
    etcd版本#etcdctlversionetcdctlversion:3.5.1APIversion:3.5问题在执行etcdctl--endpoints=http://127.0.0.1:2379defrag命令时,可能遇到错误:{"level":"warn","ts":"2024-06-12T18:20:17.444+0800","logger":"et......
  • Dragon Boat Festival
    BeforetheDragonBoatFestival,mygrandmamadedozensofzongzimadeupofreedleaves,polishedglutinousriceandmeat.Mygrandmamakesdeliciouszongzieveryyear.Idon’tknowhowtomakeit.TheDragonBoatFestivalcamesilently.Onthemorning,......
  • k8s_示例_根据CPU使用率自动扩展Pod数量并使Pod分布在不同节点
    我们从制作测试用镜像开始,后续一步一步实现在k8s中使pod根据cpu用量自动扩展pod个数。知识准备在做这个示例之前,需要了解k8s(也叫kubernetes)基本原理,了解k8s是用来干嘛的即可,以及deployment、service、hpa、镜像、docker等概念。不然会有些晕的,不知道这些配置和这些操作......
  • mybatis的mapper中的sql涉及嵌套且外部引用导致的问题:XML fragments parsed from prev
    假设xxx.xml中有类似下方的sql嵌套:<?xmlversion="1.0"encoding="UTF-8"?><!DOCTYPEmapperPUBLIC"-//mybatis.org//DTDMapper3.0//EN""http://mybatis.org/dtd/mybatis-3-mapper.dtd"><mappernamespace="com.xx......
  • The dragon Boat Festival
    TheDragonBoatFestival,alsoknownastheDuanwuFestival,isatraditionalChineseholidaycelebratedonthefifthdayofthefifthmonthofthelunarcalendar.ItfallsonJune3rdthisyear.Thisfestivalhasahistoryspanningover2000yearsandis......
  • Dragon Boat Festival
    essenceofthisfestivalAsweallknow,DragonBoatFestivalisatraditionalchinesefestivaltoshowourrespecttoQuYuan.Dragonboatracing,eatingzongzi,prayingforblessings,andwardingoffevilthingsarethecustomthemesoftheDragonBoat......
  • Dragon Boat Festival
    AstheDragonBoatFestivalapproaches,ourcommunityburstswithexcitementandanticipationforthisholiday.Inmyeyes,thisfestivalisrootedinourChinesehistoryandcultureandisfamouswithawell-knownpeoplecalledQuyuan.Inourcommunity,......
  • Dragon Boat Festival
    BeforetheDragonBoatFestival,mygrandmamadedozensofzongzimadeupofreedleaves,polishedglutinousriceandmeat.Mygrandmamakesdeliciouszongzieveryyear.Idon’tknowhowtomakeit.TheDragonBoatFestivalcamesilently.Onthemorning,......
  • The Dragon Boat festival
    TheDragonBoatFestival,celebratedonthefifthdayofthefifthlunarmonth,isatimeofgreatjoyandexcitementinChina.Itisafestivalrichinhistoryandtradition,honoringthememoryoftheancientpoetQuYuan.Oneofthemostbelovedtradit......
  • Dragon Boat Festival
    Asweallknow,theDragonBoatFestivalholdsprofoundhistoricalandculturalsignificance.TheDragonBoatFestivalwasoriginallyasummerfestivaltogetridofplague,andlateritwasusedasafestivalinmemoryofQuYuan.Dragonboatracingsymb......