chunking

2024-11-20RAG中late chunking的实验效果测试（续）
前文使用了jinaaiv2的模型，接下来我们看看v3版本latechunking的实际效果，为了快速，我直接使用官方的api！ #importrequests#url='https://api.jina.ai/v1/embeddings'headers={'Content-Type':'application/json','Authorization':'Bear
2024-11-20RAG中late chunking的实验效果测试（续2）
针对前面RAG测试的长文本问题，我又增加了长文本测试（代码同前）：context_test_documents=[#文档1:AI发展史(约2500tokens)"""人工智能的发展历程可以追溯到20世纪50年代。1950年，图灵提出著名的"图灵测试"，这被认为是人工智能研究的开端。在接下来的几十年里
2024-11-20RAG中late chunking的实验效果测试
代码：importosimportjsonimporttorchimportnumpyasnpimportspacyfromspacy.tokensimportDocfromspacy.languageimportLanguageimporttransformersfromtransformersimportAutoModelfromtransformersimportAutoTokenizerdefsentence_chunker
2024-11-19late chunking 源码分析-https://github.com/jina-ai/late-chunking
importbisectimportloggingfromtypingimportDict,List,Optional,Tuple,Unionfromllama_index.core.node_parserimportSemanticSplitterNodeParserfromllama_index.core.schemaimportDocumentfromllama_index.embeddings.huggingfaceimportHugging
2024-10-30Meta-Chunking：一种用于提高RAG性能的文本分割技术
尽管RAG技术在LLMs中具有潜力，但在文本分块方面常常被忽视。文本分块的质量直接影响知识密集型任务的表现。本文提出Meta-Chunking概念，这是一种介于句子和段落之间的文本分割技术，旨在通过逻辑感知来提高文本分割的效率；设计了两种基于LLMs的分块策略：边际采样分块（MarginSam
2024-09-115 levels of text splitting
https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb Inthistutorialwearereviewingthe5LevelsOfTextSplitting.Thisisanunofficiallistputtogetherforfunandeducationalpurposes.Evertrytoputalongpiece
2024-08-05《Advanced RAG》-05-探索语义分块（Semantic Chunking）
摘要文章首先介绍了语义分块在RAG中的位置和作用，并介绍了常见的基于规则的分块方法。然后，阐述了语义分块的目的是确保每个分块包含尽可能多的独立语义信息。接着，文章分别介绍了三种语义分块方法的原理和实现方法，并对每种方法进行了总结和评估。文章观点语义分块是R