5 levels of text splitting

时间：2024-09-11 18:15:31浏览次数：9

标签：Level text splitting Splitting levels chunking your

https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb

In this tutorial we are reviewing the 5 Levels Of Text Splitting. This is an unofficial list put together for fun and educational purposes.

Ever try to put a long piece of text into ChatGPT but it tells you it’s too long? Or you're trying to give your application better long term memory, but it’s still just not quite working.

One of the most effective strategies to improve performance of your language model applications is to split your large data into smaller pieces. This is call splitting or chunking (we'll use these terms interchangeably). In the world of multi-modal, splitting also applies to images.

We are going to cover a lot, but if you make it to the end, I guarantee you’ll have a solid grasp on chunking theory, strategies, and resources to learn more.

Levels Of Text Splitting

Level 1: Character Splitting - Simple static character chunks of data
Level 2: Recursive Character Text Splitting - Recursive chunking based on a list of separators
Level 3: Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)
Level 4: Semantic Splitting - Embedding walk based chunking
Level 5: Agentic Splitting - Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
*Bonus Level:* Alternative Representation Chunking + Indexing - Derivative representations of your raw text that will aid in retrieval and indexing

Notebook resources:

Video Overview - Walkthrough of this code with commentary
ChunkViz.com - Visual representation of chunk splitting methods
RAGAS - Retrieval evaluation framework

This tutorial was created with ❤️ by Greg Kamradt. MIT license, attribution is always welcome.

This tutorial will use code from LangChain (pip install langchain) & Llama Index (pip install llama-index)

标签：Level,text,splitting,Splitting,levels,chunking,your
From： https://www.cnblogs.com/hansjorn/p/18408676

java学习之HttpClient忽略安全证书(SSLContext)
1.我们在写https请求时候，经常会遇见安全证书(SSL)验证失败的情况，如下图。上图异常就是因为SSL验证失败导致的，常规的做法是忽略证书认证。方法如下：第一步：需要重写认证的证书类 X509ExtendedTrustManager。第二步：创建SSLContext对象。第三步：将SSLContext对象设置到HttpClien......
实现keras.textvectorization自由tf-idf篇
本篇，带你简略了解如何使用keras.textvectorization来处理词袋模型（tfidf）计算1、替换：如果发现以下内容实现存在障碍，请替换方法：适用sklearn来计算tf-idf，sklearn里也有包装比较好的各类如tfidfVectorize等模块方法，通过fit-transform来实现2、keras.textvectorization的区别及优......
[NLP] TextRank 算法
1概述：TextRank算法1.1TextRank算法：源于PageRank算法感兴趣PageRank的朋友，请转阅：[机器学习/Python]PageRank原理与实现-博客园/千千寰宇TextRank算法基于PageRank，用于为文本生成关键字、摘要、计算语句(短语或者词汇)的重要性排名，而PageRank最初是因Google搜索......
【Qt】解决设置QPlainTextEdit控件的Tab为4个空格
前言PyQt5是一个用于创建跨平台桌面应用程序的Python绑定集合，它提供了对Qt应用程序框架的访问。用于开发具有图形用户界面（GUI）的应用程序，以及非GUI程序。PyQt5使得Python开发者可以使用Qt的丰富功能来构建应用程序。QPlainTextEdit是Qt框架中的一个纯文本编辑器......
tkinter Text edit_undo()/edit_redo() 没反应解决方法
问题tkinter.Text调用edit_undo()和edit_redo()没反应。问题分析这是因为没有设置Text的undo参数为True，设置后才能“激活”edit_undo()和edit_redo()。这个错误难发现是因为它没有报错，查了好多资料才发现……希望以后python官方能改进这点。解决方法fromtkinterimport*......
使用Blip的预训练好的imageEncoder并替换其textDecoder
fromtransformersimportBlipProcessor,BlipTextConfigfromtransformers.models.blip.modeling_blip_textimportBlipTextLMHeadModelfromtransformersimportAutoTokenizermodel=BlipForConditionalGeneration.from_pretrained("huggingface.co/Salesforc......
Sublime Text 4 Build 4126 永久激活破解+汉化
不得不说sublime是轻量化IDE性能王者，比vscode要快不少，不过vscode如今胜在生态。下面正式开始破解教程！首先x64dbg载入sublime_text.exe主程序，shift+F9跑起来先找个最明显的点入手，例如点击帮助->关于关于信息这里会有注册状态，未注册显示Unregistered，那首先想到字符串大......
如何用 ThreadLocal 构建强大的 ContextManager
在实际开发中，我们经常需要维护一些上下文信息，这样可以避免在方法调用过程中传递过多的参数。例如，当Web服务器收到一个请求时，需要解析当前登录状态的用户，并在后续的业务处理中使用这个用户名。如果只需要维护一个上下文数据，如用户名，可以通过方法传参的方式，将用户名作为参数传......
轻松管理上下文：ThreadLocal 助力 ContextManager
在实际开发中，我们经常需要维护一些上下文信息，这样可以避免在方法调用过程中传递过多的参数。例如，当Web服务器收到一个请求时，需要解析当前登录状态的用户，并在后续的业务处理中使用这个用户名。如果只需要维护一个上下文数据，如用户名，可以通过方法传参的方式，将用户名作为参数传......
利用 ThreadLocal 打造 ContextManager 的最佳实践1
在实际开发中，我们经常需要维护一些上下文信息，这样可以避免在方法调用过程中传递过多的参数。例如，当Web服务器收到一个请求时，需要解析当前登录状态的用户，并在后续的业务处理中使用这个用户名。如果只需要维护一个上下文数据，如用户名，可以通过方法传参的方式，将用户名作为参数传......

5 levels of text splitting

相关文章

赞助商

阅读排行