late chunking 源码分析-https://github.com/jina-ai/late-chunking

标签：嵌入分块 chunking chunk late 源码 model 文本

import bisect
import logging
from typing import Dict, List, Optional, Tuple, Union

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.schema import Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer

# Set the logging level to WARNING to suppress INFO and DEBUG messages
logging.getLogger('sentence_transformers').setLevel(logging.WARNING)

CHUNKING_STRATEGIES = ['semantic', 'fixed', 'sentences']


class Chunker:
    def __init__(
        self,
        chunking_strategy: str,
    ):
        if chunking_strategy not in CHUNKING_STRATEGIES:
            raise ValueError("Unsupported chunking strategy: ", chunking_strategy)
        self.chunking_strategy = chunking_strategy
        self.embed_model = None
        self.embedding_model_name = None

    def _setup_semantic_chunking(self, embedding_model_name):
        if embedding_model_name:
            self.embedding_model_name = embedding_model_name

        self.embed_model = HuggingFaceEmbedding(
            model_name=self.embedding_model_name,
            trust_remote_code=True,
            embed_batch_size=1,
        )
        self.splitter = SemanticSplitterNodeParser(
            embed_model=self.embed_model,
            show_progress=False,
        )

    def chunk_semantically(
        self,
        text: str,
        tokenizer: 'AutoTokenizer',
        embedding_model_name: Optional[str] = None,
    ) -> List[Tuple[int, int]]:
        if self.embed_model is None:
            self._setup_semantic_chunking(embedding_model_name)

        # Get semantic nodes
        nodes = [
            (node.start_char_idx, node.end_char_idx)
            for node in self.splitter.get_nodes_from_documents(
                [Document(text=text)], show_progress=False
            )
        ]

        # Tokenize the entire text
        tokens = tokenizer.encode_plus(
            text,
            return_offsets_mapping=True,
            add_special_tokens=False,
            padding=True,
            truncation=True,
        )
        token_offsets = tokens.offset_mapping

        chunk_spans = []

        for char_start, char_end in nodes:
            # Convert char indices to token indices
            start_chunk_index = bisect.bisect_left(
                [offset[0] for offset in token_offsets], char_start
            )
            end_chunk_index = bisect.bisect_right(
                [offset[1] for offset in token_offsets], char_end
            )

            # Add the chunk span if it's within the tokenized text
            if start_chunk_index < len(token_offsets) and end_chunk_index <= len(
                token_offsets
            ):
                chunk_spans.append((start_chunk_index, end_chunk_index))
            else:
                break

        return chunk_spans

    def chunk_by_tokens(
        self,
        text: str,
        chunk_size: int,
        tokenizer: 'AutoTokenizer',
    ) -> List[Tuple[int, int, int]]:
        tokens = tokenizer.encode_plus(
            text, return_offsets_mapping=True, add_special_tokens=False
        )
        token_offsets = tokens.offset_mapping

        chunk_spans = []
        for i in range(0, len(token_offsets), chunk_size):
            chunk_end = min(i + chunk_size, len(token_offsets))
            if chunk_end - i > 0:
                chunk_spans.append((i, chunk_end))

        return chunk_spans

    def chunk_by_sentences(
        self,
        text: str,
        n_sentences: int,
        tokenizer: 'AutoTokenizer',
    ) -> List[Tuple[int, int, int]]:
        tokens = tokenizer.encode_plus(
            text, return_offsets_mapping=True, add_special_tokens=False
        )
        token_offsets = tokens.offset_mapping

        chunk_spans = []
        chunk_start = 0
        count_chunks = 0
        for i in range(0, len(token_offsets)):
            if tokens.tokens(0)[i] in ('.', '!', '?') and (
                (len(tokens.tokens(0)) == i + 1)
                or (tokens.token_to_chars(i).end != tokens.token_to_chars(i + 1).start)
            ):
                count_chunks += 1
                if count_chunks == n_sentences:
                    chunk_spans.append((chunk_start, i + 1))
                    chunk_start = i + 1
                    count_chunks = 0
        if len(tokens.tokens(0)) - chunk_start > 1:
            chunk_spans.append((chunk_start, len(tokens.tokens(0))))
        return chunk_spans

    def chunk(
        self,
        text: str,
        tokenizer: 'AutoTokenizer',
        chunking_strategy: str = None,
        chunk_size: Optional[int] = None,
        n_sentences: Optional[int] = None,
        embedding_model_name: Optional[str] = None,
    ):
        chunking_strategy = chunking_strategy or self.chunking_strategy
        if chunking_strategy == "semantic":
            return self.chunk_semantically(
                text,
                embedding_model_name=embedding_model_name,
                tokenizer=tokenizer,
            )
        elif chunking_strategy == "fixed":
            if chunk_size < 4:
                raise ValueError("Chunk size must be >= 4.")
            return self.chunk_by_tokens(text, chunk_size, tokenizer)
        elif chunking_strategy == "sentences":
            return self.chunk_by_sentences(text, n_sentences, tokenizer)
        else:
            raise ValueError("Unsupported chunking strategy") 

==》具体使用示例：Chunked Pooling
This notebooks explains how the chunked pooling can be implemented. First you need to install the requirements:

!pip install -r requirements.txt
Then we load a model which we want to use for the embedding. We choose jinaai/jina-embeddings-v2-base-en but any other model which supports mean pooling is possible. However, models with a large maximum context-length are preferred.

from transformers import AutoModel
from transformers import AutoTokenizer

from chunked_pooling import chunked_pooling, chunk_by_sentences

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
/home/michael/workspace/chunked-pooling/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Now we define the text which we want to encode and split it into chunks. The chunk_by_sentences function also returns the span annotations. Those specify the number of tokens per chunk which is needed for the chunked pooling.

input_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

# determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by population."
- " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."
- " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."
Now we encode the chunks with the traditional and the context-sensitive chunked pooling method:

# chunk before
embeddings_traditional_chunking = model.encode(chunks)

# chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = chunked_pooling(model_output, [span_annotations])[0]
Finally, we compare the similarity of the word "Berlin" with the chunks. The similarity should be higher for the context-sensitive chunked pooling method:

import numpy as np

cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

berlin_embedding = model.encode('Berlin')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
    print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))

similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.84862185
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.7084338
similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.84980094
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.7534553

根据提供的内容，你可以看到late chunking实现的关键思路吗？ GPT-4o

从提供的内容来看，"late chunking"是利用长上下文嵌入模型对长度超出模型容量的文本进行分块的一种方法，目的是在对文本进行分块之后保持尽量多的上下文信息，从而在后续的上下文敏感处理过程（如嵌入生成）中提高文本表示的准确性。

关键思路和实现步骤如下：

初步分块策略：
- 使用特定的分块策略（如句子分块法）对文本进行初步分块，但尽量保留原始的上下文，比如将文本按句子划分，而不是简单地按固定长度分割。
上下文敏感填充（Chunked Pooling）：
- 初步分块后，使用上下文敏感的嵌入模型对整个文本进行编码，然后根据提前定义好的分块信息（特别是结合上下文的句子边界信息）进行专注于上下文的嵌入池化。这种方法强调在不丢失长段信息的情况下进行上下文感知的嵌入计算。
使用上下文信息进行嵌入：
- 在对文本进行分块处理后，并不立即生成嵌入向量，而是在对整个文本进行完整嵌入处理后，根据分块策略重新整合为各个“分块”的嵌入。这与传统的分块再嵌入的方法相对比，能够保留更多的全局上下文信息。
对结果进行对比分析：
- 将传统分块方法生成的嵌入与上下文敏感方法生成的嵌入进行对比，比如通过计算相似性比较每一个分块与特定关键词(如"Berlin")之间的差异。这种对比能揭示上下文敏感方法在保留语义一致性方面的优越性。

在实现上：

使用了Jina的嵌入模型jinaai/jina-embeddings-v2-base-en，因为其支持对长文本的上下文敏感嵌入生成。
通过对文本分块后进行完整的文本输入和编码，最后利用分块策略进行上下文感知的结果整合。

这个流程的关键在于通过先进的编码器模型和网络架构，在保持上下文信息完备的情况下生成高质量的嵌入，以优化信息检索或相似度测量任务中的表现。

在实际应用中，该方法被证明能够提升对关键字或者文本片段的相似度分析，尤其是在长文本中，模型能更准确地保留文本的内在语义。不过，实际操作中需要根据具体的任务和数据进行调试来实现最优的结果。

使用了Jina的嵌入模型jinaai/jina-embeddings-v2-base-en，因为其支持对长文本的上下文敏感嵌入生成。 ==》其他模型可以吗？ GPT-4o

“late chunking”策略可以在多个高级模型中实现，并不限于Jina的嵌入模型。比如，GPT-3.5/4、Longformer、RoBERTa、T5等模型都具有较强的上下文理解能力，能够处理长文本并生成上下文感知的嵌入。在选择模型时要考虑文本长度、应用需求和可

用的计算资源。

对于许多应用来说，将整个文本文档编码为单个嵌入表示是没有用的。许多应用需要检索文本的较小部分，而基于密集向量的信息检索系统通常在较小的文本段中表现更好，因为嵌入向量的信息容量有限。

RAG（检索增强生成）是最著名的应用之一，它需要将文档集合拆分成较小的文本块。这些块通常存储在向量数据库中，向量表示由文本嵌入模型创建。在运行时，相同的嵌入模型将查询文本编码为向量表示，用于识别相关的存储文本块。然后将它们传递给大型语言模型 (LLM)，该模型根据检索到的文本合成对查询的响应。

上下文问题

这种简单的 RAG 方法并非没有挑战。长距离上下文依赖关系（即当相关信息分散在多个块中并且将文本片段脱离上下文会使它们变得无用）在这种方法中处理得特别糟糕。在上图中，我们可以看到一篇被分成多个句子块的维基百科文章。我们可以看到，像“its”和“the city”这样的短语引用了仅在第一句中提到的“Berlin”，例如，嵌入模型很难将其链接到相应的实体以产生高质量的嵌入表示。

例如，如果我们将维基百科文章拆分成句子长度的片段，如上例所示，RAG 系统可能无法回答“柏林的人口是多少？”这样的查询。城市名称和人口永远不会同时出现在单个片段中，并且缺乏任何更大的文档上下文。呈现其中一个片段的 LLM 无法解析“它”或“这个城市”之类的照应性引用。

上下文敏感的分块

为了解决这个问题，我们利用了最近的嵌入模型可以处理的长输入序列jina-embeddings-v2-base-en。这些模型支持更长的输入文本，例如 8192 个标记jina-embeddings-v2-base-en或大约十页标准文本。这种大小的文本段不太可能具有只能通过更大的上下文来解决的上下文依赖关系。然而，我们仍然需要更小的文本块的向量表示，部分原因是 LLM 的输入大小有限，但主要是因为短嵌入向量的信息容量有限。

简单的编码方法（如上图左侧所示）在处理文本之前先对文本进行分块，使用句子、段落和最大长度限制来先验地分割文本，然后将嵌入模型应用于生成的块。而后期分块则首先将嵌入模型的转换器部分应用于整个文本或尽可能大的部分。这会为每个标记生成一个向量表示序列，其中包含来自整个文本的文本信息。为了为文本生成单个嵌入，许多嵌入模型会将这些标记表示应用于均值池化以输出单个向量。而后期分块则将均值池化应用于这个标记向量序列的较小段，从而为每个块生成考虑整个文本的嵌入。

语境敏感组块的影响

这对检索具有立即可衡量的具体效果。例如，在维基百科文章中出现“城市”和“柏林”时，表示“城市”的向量包含将其与之前提到的“柏林”联系起来的信息，使其与涉及该城市名称的查询更加匹配。

您可以在下面的数值结果中看到这一点，该结果将字符串“Berlin”的嵌入与有关柏林的文章中的各个句子进行比较。“传统相似度”列是使用先验分块的相似度值，“后期分块相似度”列是使用上下文敏感分块的相似度值。

文本	相似传统	相似性后期分块
柏林是德国的首都和最大城市，无论从面积还是人口上看都是如此。”	0.84862185	0.849546
按照市区人口计算，该市居民超过 385 万，是欧盟人口最多的城市。	0.7084338	0.82489026
该市也是德国的一个州，是该国面积第三小的州。	0.7534553	0.84980094

如您所见，包含“柏林”的第一个块的相似度得分非常接近。对于其他两个块，它们的相似度得分明显不同，因为后期分块显著提高了未明确使用“柏林”一词但有其指代性指代的句子的匹配度。

检索任务评估

为了验证这种方法除了几个简单的示例之外的有效性，我们使用BeIR中的一些检索基准对其进行了测试。这些检索任务包括一个查询集、一个文本文档语料库和一个 QRels 文件，该文件存储了与每个查询相关的文档 ID 的信息。为了识别查询的相关文档，可以对文档进行分块，将它们编码为嵌入索引，并确定每个查询嵌入的最相似块 (kNN)。由于每个块对应一个文档，因此可以将块的 kNN 排名转换为文档的 kNN 排名（对于在排名中多次出现的文档，仅保留第一次出现的文档）。之后，可以将结果排名与对应于真实 QRels 文件的排名进行比较，并计算 nDCG@10 等检索指标。我们使用传统分块和我们新颖的后期分块方法对各种 BeIR 数据集进行了此评估。为了将文本拆分成块，我们选择了一种简单的方法，即将测试分块为 256 个标记的字符串。传统和后期分块测试都使用了jina-embeddings-v2-small-en模型。

数据集	平均文档长度（字符）	传统分块（nDCG@10）	后期分块 (nDCG@10)	无分块 (nDCG@10)
科学事实	1498.4	64.20%	66.10%	63.89％
特雷科维德	1116.7	63.36%	64.70％	65.18％
FiQA2018	767.2	33.25％	33.84％	33.43%
NFC语料库	1589.8	23.46％	29.98％	30.40%
Quora	62.2	87.19％	87.19％	87.19％

在所有情况下，后期分块都提高了分数。在某些情况下，它的表现也优于将整个文档编码为单个嵌入，而对于其他数据集，没有分块表现最好。然而，这只有在不需要对块进行排序时才有意义。我们还可以看到，文档的平均长度与通过后期分块获得的 nDCG 分数的更大改进相关。

要重现评估，您可以安装依赖项pip install .并针对任务“SciFactChunked”、“TRECCOVIDChunked”、“FiQA2018Chunked”、“NFCorpusChunked”和“QuoraChunked”运行以下脚本：

python3 run_chunked_eval.py --task-name {TASK_NAME}

致谢和参考文献

感谢 Isabelle Mohr（@violenil）贡献一些代码，以及 Scott Martens（@scott-martens）审阅 README。

您可以在MTEB 存储库中找到有关评估任务的更多信息，并可以在我们的论文中找到有关长输入文本模型训练的详细信息：“Jina 嵌入 2：用于长文档的 8192 个标记通用文本嵌入”。

如果您发现 Late Chunking 在您的研究中很有用，可以引用论文Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models：

@article{gunther2024late,
  title={Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models},
  author={G{\"u}nther, Michael and Mohr, Isabelle and Williams, Daniel J and Wang, Bo and Xiao, Han},
  journal={arXiv preprint arXiv:2409.04701},
  year={2024}
}

标签：嵌入,分块,chunking,chunk,late,源码,model,文本
From： https://www.cnblogs.com/bonelee/p/18554421