首页 > 其他分享 >【langchain】在单个文档知识源的上下文中使用langchain对GPT4All运行查询

【langchain】在单个文档知识源的上下文中使用langchain对GPT4All运行查询

时间:2024-01-18 11:31:28浏览次数:38  
标签:index query Justice langchain 文档 time GPT4All 上下文


In the previous post, Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook, 我发布了一个简单的演练,让GPT4All使用langchain在2015年年中的16GB Macbook Pro上本地运行。在这篇文章中,我将提供一个简单的食谱,展示我们如何运行一个查询,该查询通过从单个基于文档的已知源检索的上下文进行扩展。

I’ve updated the previously shared notebook here to include the following…

基于文档的知识源支持的示例查询

使用langchain文档中的示例进行示例文档查询。

其想法是对文档源运行查询,以检索一些相关的上下文,并将其用作提示上下文的一部分。

#https://python.langchain.com/en/latest/use_cases/question_answering.html

template = """

Question: {question}

Answer: 

"""
 
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

天真的提示给出了一个无关紧要的答案:


%%time
query = "What did the president say about Ketanji Brown Jackson"
llm_chain.run(question)
 
CPU times: user 58.3 s, sys: 3.59 s, total: 1min 1s
Wall time: 9.75 s
 
'\nAnswer: The Pittsburgh Steelers'
 
Now let’s try with a source document.
 
#!wget https://raw.githubusercontent.com/hwchase17/langchainjs/main/examples/state_of_the_union.txt
 
from langchain.document_loaders import TextLoader

# Ideally....
loader = TextLoader('./state_of_the_union.txt')


然而,创建嵌入非常慢,所以我将使用文本片段:


#ish via chatgpt...
def search_context(src, phrase, buffer=100):
  with open(src, 'r') as f:
    txt=f.read()
    words = txt.split()
    index = words.index(phrase)
    start_index = max(0, index - buffer)
    end_index = min(len(words), index + buffer+1)
    return ' '.join(words[start_index:end_index])

fragment = './fragment.txt'
with open(fragment, 'w') as fo:
    _txt = search_context('./state_of_the_union.txt', "Ketanji")
    fo.write(_txt)
 
!cat $fragment

Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. We can do both. At our border, we’ve installed new technology like cutting-edge
 
loader = TextLoader('./fragment.txt')


从知识源文本生成索引:

#%pip install chromadb
from langchain.indexes import VectorstoreIndexCreator
 
%time
# Time: ~0.5s per token
# NOTE: "You must specify a persist_directory oncreation to persist the collection."
# TO DO: How do we load in an already generated and persisted index?
index = VectorstoreIndexCreator(embedding=llama_embeddings,
                                vectorstore_kwargs={"persist_directory": "db"}
                               ).from_loaders([loader])
 
Using embedded DuckDB with persistence: data will be stored in: db

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 7.87 µs
 
%time
pass

# The following errors...
#index.query(query, llm=llm)
 
# With the full SOTU text, I got:
# Error: llama_tokenize: too many tokens;
# Also occasionally getting:
# ValueError: Requested tokens exceed context window of 512

# If we do get passed that,
# NotEnoughElementsException

# For the latter, somehow need to set something like search_kwargs={"k": 1}


在默认情况下,检索器似乎期望4个结果文档。我看不出如何通过下限(在这种情况下,可以接受单个响应文档),所以我们必须滚动自己的链…


%%time

# Roll our own....

#https://github.com/hwchase17/langchain/issues/2255
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Again, we should persist the db and figure out how to reuse it
docsearch = Chroma.from_documents(texts, llama_embeddings)


在没有持久性的情况下使用嵌入式DuckDB:数据将是瞬态的 CPU times: user 5min 59s, sys: 1.62 s, total: 6min 1s Wall time: 49.2 s


%%time

# Just getting a single result document from the knowledge lookup is fine...
MIN_DOCS = 1

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))
 
CPU times: user 861 µs, sys: 2.97 ms, total: 3.83 ms
Wall time: 7.09 ms


现在在知识源的上下文中运行我们的查询怎么样?


%%time

print(query)

qa.run(query)
 
What did the president say about Ketanji Brown Jackson
CPU times: user 7min 39s, sys: 2.59 s, total: 7min 42s
Wall time: 1min 6s

' The president honored Justice Stephen Breyer and acknowledged his service to this country before introducing Justice Ketanji Brown Jackson, who will be serving as the newest judge on the United States Court of Appeals for the District of Columbia Circuit.'


更精确的查询如何?

%%time
query = "Identify three things the president said about Ketanji Brown Jackson"

qa.run(query)
 
CPU times: user 10min 20s, sys: 4.2 s, total: 10min 24s
Wall time: 1min 35s

' The president said that she was nominated by Barack Obama to become the first African American woman to sit on the United States Court of Appeals for the District of Columbia Circuit. He also mentioned that she was an Army veteran, a Constitutional scholar, and is retiring Justice of the United States Supreme Court.'

嗯……我们是否正在进行对话,并了解以前的输出?在之前的尝试中,我似乎确实得到了非常相关的答案……我们是否可能得到了几个以上的结果文档,并选择了不太好的结果文档?或者,模型在检索内容上是否有遗漏?我们是否可以查看knowledge查找中的示例结果文档,以帮助了解发生了什么?

让我们看看是否可以格式化响应

…
 
%%time

query = """
Identify three things the president said about Ketanji Brown Jackson. Provide the answer in the form: 

- ITEM 1
- ITEM 2
- ITEM 3
"""

qa.run(query)
 
CPU times: user 12min 31s, sys: 4.24 s, total: 12min 35s
Wall time: 1min 45s

"\n\nITEM 1: President Trump honored Justice Breyer for his service to this country, but did not specifically mention Ketanji Brown Jackson.\n\nITEM 2: The president did not identify any specific characteristics about Justice Breyer that would be useful in identifying her.\n\nITEM 3: The president did not make any reference to Justice Breyer's current or past judicial rulings or cases during his speech."

自我介绍

  • 做一个简单介绍,酒研年近48 ,有20多年IT工作经历,目前在一家500强做企业架构.因为工作需要,另外也因为兴趣涉猎比较广,为了自己学习建立了三个博客,分别是【全球IT瞭望】,【架构师研究会】和【开发者开聊】,有更多的内容分享,谢谢大家收藏。
  • 企业架构师需要比较广泛的知识面,了解一个企业的整体的业务,应用,技术,数据,治理和合规。之前4年主要负责企业整体的技术规划,标准的建立和项目治理。最近一年主要负责数据,涉及到数据平台,数据战略,数据分析,数据建模,数据治理,还涉及到数据主权,隐私保护和数据经济。 因为需要,比如数据资源入财务报表,另外数据如何估值和货币化需要财务和金融方面的知识,最近在学习财务,金融和法律。打算先备考CPA,然后CFA,如果可能也想学习法律,备战律考。
  • 欢迎爱学习的同学朋友关注,也欢迎大家交流。全网同号【架构师研究会】

【langchain】在单个文档知识源的上下文中使用langchain对GPT4All运行查询_架构师

欢迎收藏  【全球IT瞭望】,【架构师酒馆】和【开发者开聊】.

标签:index,query,Justice,langchain,文档,time,GPT4All,上下文
From: https://blog.51cto.com/jiagoushipro/9305114

相关文章

  • Mygin实现上下文
    本篇是Mygin的第三篇目的将路由独立出来,方便后续扩展修改上下文Context,对http.ResponseWriter和http.Request进行封装,实现对JSON、HTML等的支持路由新建一个router文件,将Mygin实现简单的路由中将路由部分复制出来新建Mygin/router.gopackagemyginimport( "log"......
  • 机器学习周刊第六期:哈佛大学机器学习课、Chatbot Ul 2.0 、LangChain v0.1.0、Mixtral
    ---date:2024/01/08---吴恩达和Langchain合作开发了JavaScript生成式AI短期课程:《使用LangChain.js构建LLM应用程序》大家好,欢迎收看第六期机器学习周刊本期介绍10个内容,涉及Python、机器学习、大模型等,目录如下:1、哈佛大学机器学习课2、第一个JavaScript生成......
  • 基于langchain和文心一言的检索增强生成(RAG)初级实验
    一、什么是RAG?RAG的架构如图中所示,简单来讲,RAG就是通过检索获取相关的知识并将其融入Prompt,让大模型能够参考相应的知识从而给出合理回答。因此,可以将RAG的核心理解为“检索+生成”,前者主要是利用向量数据库的高效存储和检索能力,召回目标知识;后者则是利用大模型和Prompt工程,将召回......
  • 开发篇1:使用原生api和Langchain调用大模型
    对大模型的调用通常有以下几种方式:方式一、大模型厂商都会定义http风格的请求接口,在代码中可以直接发起http请求调用;方式二、在开发环境中使用大模型厂商提供的api;方式三、使用开发框架Langchain调用,这个就像java对数据库的调用一样,可以直接用jdbc也可以使用第三方框架,第三方框架......
  • 从 ECMAScript 6 角度谈谈执行上下文
    大家好,我是归思君起因是最近了解JS执行上下文的时候,发现很多书籍和资料,包括《JavaScript高级程序设计》、《JavaScript权威指南》和网上的一些博客专栏,都是从ES3角度来谈执行上下文,用ES6规范解读的比较少,所以想从ES6的角度看一下执行上下文。下面我尝试用ECMAScript6规范文......
  • 【虹科分享】用Redis为LangChain定制AI代理——OpenGPTs
    penAI最近推出了OpenAIGPTs——一个构建定制化AI代理的无代码“应用商店”,随后LangChain开发了类似的开源工具OpenGPTs。OpenGPTs是一款低代码的开源框架,专用于构建定制化的人工智能代理。因为Redis具有高速和稳定性的优点,所以LangChain选择了Redis来作为OpenGPTs的默认向量数据......
  • js 执行上下文与作用域
    执行上下文(以下简称“上下文”)的概念在JavaScript中是颇为重要的。变量或函数的上下文决定了它们可以访问哪些数据,以及它们的行为。每个上下文都有一个关联的变量对象(variableobject),而这个上下文中定义的所有变量和函数都存在于这个对象上。全局上下文是最外层的上下文。......
  • Flask-apscheduler获取或修改上下文中config数据
    __init__.py 1fromflask_apschedulerimportAPScheduler2...34scheduler=APScheduler()567fromappimportapp891011121314defcreate_app(config_name):15app=Flask(__name__)1617CORS(app,resources=r'/*......
  • LangChain第一个稳定版本重磅发布
    LangChain的第一个稳定版本,即LangChain0.1.0,于2024年1月8日正式发布,这是一个值得庆祝的里程碑,也是LangChain项目的一个新的起点。在这个版本中,LangChain引入了许多重要的特性、架构变化、版本规范和发展方向,为LLM应用的开发者和用户带来了更好的体验和性能。在本文......
  • JavaScript的闭包、执行上下文、到底是怎么回事?还有必要学吗?
    在上一课,我们了解了JavaScript执行中最粗粒度的任务:传给引擎执行的代码段。并且,我们还根据“由JavaScript引擎发起”还是“由宿主发起”,分成了宏观任务和微观任务,接下来我们继续去看一看更细的执行粒度。一段JavaScript代码可能会包含函数调用的相关内容,从今天开始,我们就用两......