首页 > 其他分享 >LangChain 0.2 - 构建本地 RAG应用

LangChain 0.2 - 构建本地 RAG应用

时间:2024-06-02 14:00:10浏览次数:19  
标签:RAG llama 0.2 LangChain LLM per ms time print

本文翻译整理自:Build a Local RAG Application
https://python.langchain.com/v0.2/docs/tutorials/local_rag/


文章目录


一、项目说明

PrivateGPTllama.cppGPT4Allllamafile等项目的流行凸显了在本地运行 LLM 的重要性。

LangChain与许多可以在本地运行的开源 LLM集成。

请参阅此处了解这些 LLM 的设置说明。

例如,在这里我们展示如何使用本地嵌入和本地 LLM 在本地运行GPT4AllLLaMA2例如,在您的笔记本电脑上)。


二、文档加载

首先,安装本地嵌入和向量存储所需的包。

pip install --upgrade --quiet  langchain langchain-community langchainhub gpt4all langchain-chroma 

加载并拆分示例文档。

我们将使用有关代理的博客文章作为示例。

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

API 参考:WebBaseLoader | RecursiveCharacterTextSplitter

接下来,以下步骤将GPT4All在本地下载嵌入(如果您还没有)。

from langchain_chroma import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())

API 参考:GPT4AllEmbeddings

测试相似性搜索正在与我们的本地嵌入一起工作。

question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)
#  -> 4

docs[0]

Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log"})

三、模型


1、LLaMA2

注意:新版本llama-cpp-python使用GGUF模型文件(参见此处)。

如果您有现有的 GGML 模型,请参阅此处获取有关 GGUF 转换的说明。

并且/或者,您可以下载GGUF转换模型(例如,这里)。

最后,按照此处详细说明安装llama-cpp-python

%pip install --upgrade --quiet  llama-cpp-python

要在 Apple Silicon 上启用 GPU,请按照此处的步骤使用 Python 绑定with Metal support

特别是,确保conda使用您创建的正确虚拟环境(miniforge3)。

例如,对我来说:

conda activate /Users/rlm/miniforge3/envs/llama

确认后:

! CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 /Users/rlm/miniforge3/envs/llama/bin/pip install -U llama-cpp-python --no-cache-dir

from langchain_community.llms import LlamaCpp

API 参考:LlamaCpp

按照llama.cpp 文档中所述设置模型参数。

n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/llama-2-13b-chat.ggufv3.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    verbose=True,
)

请注意,这些表明Metal 已正确启用

ggml_metal_init: allocating
ggml_metal_init: using MPS

llm.invoke("Simulate a rap battle between Stephen Colbert and John Oliver")

Llama.generate: prefix-match hit

by jonathan

Here’s the hypothetical rap battle:

[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I’m here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let’s see who gets the most laughs while taking shots at each other

[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I’m here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it’s time to see who can out-rap whom

[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I’m America’s funnyman! Who’s the one taking you? Nobody!

[John Oliver]: Hey Stephen Colbert, don’t get too cocky. You may


llama_print_timings:        load time =  4481.74 ms
llama_print_timings:      sample time =   183.05 ms /   256 runs   (    0.72 ms per token,  1398.53 tokens per second)
llama_print_timings: prompt eval time =   456.05 ms /    13 tokens (   35.08 ms per token,    28.51 tokens per second)
llama_print_timings:        eval time =  7375.20 ms /   255 runs   (   28.92 ms per token,    34.58 tokens per second)
llama_print_timings:       total time =  8388.92 ms

"by jonathan \n\nHere's the hypothetical rap battle:\n\n[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other\n\n[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom\n\n[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!\n\n[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may"

2、GPT4All

类似地,我们可以使用GPT4All

下载 GPT4All 模型二进制文件

GPT4All上的模型浏览器是选择和下载模型的好方法。

然后,指定您下载的路径。

例如,对我来说,模型就在这里:

/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin
from langchain_community.llms import GPT4All

gpt4all = GPT4All(
    model="/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin",
    max_tokens=2048,
)

API 参考:GPT4All


3、llamafile

在本地运行 LLM 的最简单方法之一是使用llamafile。您需要做的就是:

1)从 HuggingFace下载 llamafile

2)使文件可执行

3)运行文件

llamafiles 将模型权重和专门编译的版本捆绑llama.cpp到一个文件中,该文件可以在大多数计算机上运行,而无需任何其他依赖项。它们还附带一个嵌入式推理服务器,可提供与您的模型交互的API 。

这是一个简单的 bash 脚本,显示了所有 3 个设置步骤:

# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser

运行上述设置步骤后,您可以通过 LangChain 与模型进行交互:

from langchain_community.llms.llamafile import Llamafile

llamafile = Llamafile()

llamafile.invoke("Here is my grandmother's beloved recipe for spaghetti and meatballs:")

API 参考:Llamafile

'\n-1 1/2 (8 oz. Pounds) ground beef, browned and cooked until no longer pink\n-3 cups whole wheat spaghetti\n-4 (10 oz) cans diced tomatoes with garlic and basil\n-2 eggs, beaten\n-1 cup grated parmesan cheese\n-1/2 teaspoon salt\n-1/4 teaspoon black pepper\n-1 cup breadcrumbs (16 oz)\n-2 tablespoons olive oil\n\nInstructions:\n1. Cook spaghetti according to package directions. Drain and set aside.\n2. In a large skillet, brown ground beef over medium heat until no longer pink. Drain any excess grease.\n3. Stir in diced tomatoes with garlic and basil, and season with salt and pepper. Cook for 5 to 7 minutes or until sauce is heated through. Set aside.\n4. In a large bowl, beat eggs with a fork or whisk until fluffy. Add cheese, salt, and black pepper. Set aside.\n5. In another bowl, combine breadcrumbs and olive oil. Dip each spaghetti into the egg mixture and then coat in the breadcrumb mixture. Place on baking sheet lined with parchment paper to prevent sticking. Repeat until all spaghetti are coated.\n6. Heat oven to 375 degrees. Bake for 18 to 20 minutes, or until lightly golden brown.\n7. Serve hot with meatballs and sauce on the side. Enjoy!'

四、链式使用

我们可以通过传递检索到的文档和一个简单的提示来使用任一模型创建一个摘要链。

它使用提供的输入键值格式化提示模板,并将格式化的字符串传递给GPT4AllLLama-V2或另一个指定的 LLM。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# Prompt
prompt = PromptTemplate.from_template(
    "Summarize the main themes in these retrieved docs: {docs}"
)


# Chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


chain = {"docs": format_docs} | prompt | llm | StrOutputParser()

# Run
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
chain.invoke(docs)

API 参考:StrOutputParser | PromptTemplate


Llama.generate: prefix-match hit

Based on the retrieved documents, the main themes are:

  1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.
  2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.
  3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.
  4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.

llama_print_timings:        load time =  1191.88 ms
llama_print_timings:      sample time =   134.47 ms /   193 runs   (    0.70 ms per token,  1435.25 tokens per second)
llama_print_timings: prompt eval time = 39470.18 ms /  1055 tokens (   37.41 ms per token,    26.73 tokens per second)
llama_print_timings:        eval time =  8090.85 ms /   192 runs   (   42.14 ms per token,    23.73 tokens per second)
llama_print_timings:       total time = 47943.12 ms

'\nBased on the retrieved documents, the main themes are:\n1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\n2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\n3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\n4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.'

五、问答

我们还可以使用 LangChain Prompt Hub 来存储和获取特定于模型的提示。

让我们尝试使用默认的 RAG 提示,这里

from langchain import hub

rag_prompt = hub.pull("rlm/rag-prompt")
rag_prompt.messages

[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]

from langchain_core.runnables import RunnablePassthrough, RunnablePick

# Chain
chain = (
    RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Run
chain.invoke({"context": docs, "question": question})

API 参考:RunnablePassthrough | RunnablePick

Llama.generate: prefix-match hit

Task can be done by down a task into smaller subtasks, using simple prompting like “Steps for XYZ.” or task-specific like “Write a story outline” for writing a novel.


llama_print_timings:        load time = 11326.20 ms
llama_print_timings:      sample time =    33.03 ms /    47 runs   (    0.70 ms per token,  1422.86 tokens per second)
llama_print_timings: prompt eval time =  1387.31 ms /   242 tokens (    5.73 ms per token,   174.44 tokens per second)
llama_print_timings:        eval time =  1321.62 ms /    46 runs   (   28.73 ms per token,    34.81 tokens per second)
llama_print_timings:       total time =  2801.08 ms

{'output_text': '\nTask can be done by down a task into smaller subtasks, using simple prompting like "Steps for XYZ." or task-specific like "Write a story outline" for writing a novel.'}

现在,让我们尝试使用专门针对 LLaMA 的提示,其中包含特殊标记

# Prompt
rag_prompt_llama = hub.pull("rlm/rag-prompt-llama")
rag_prompt_llama.messages

ChatPromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, template="[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]", template_format='f-string', validate_template=True), additional_kwargs={})])

# Chain
chain = (
    RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
    | rag_prompt_llama
    | llm
    | StrOutputParser()
)

# Run
chain.invoke({"context": docs, "question": question})

Llama.generate: prefix-match hit

Sure, I’d be happy to help! Based on the context, here are some to task:

  1. LLM with simple prompting: This using a large model (LLM) with simple prompts like “Steps for XYZ” or “What are the subgoals for achieving XYZ?” to decompose tasks into smaller steps.
  2. Task-specific: Another is to use task-specific, such as “Write a story outline” for writing a novel, to guide the of tasks.
  3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.

As fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.


llama_print_timings:        load time = 11326.20 ms
llama_print_timings:      sample time =   144.81 ms /   207 runs   (    0.70 ms per token,  1429.47 tokens per second)
llama_print_timings: prompt eval time =  1506.13 ms /   258 tokens (    5.84 ms per token,   171.30 tokens per second)
llama_print_timings:        eval time =  6231.92 ms /   206 runs   (   30.25 ms per token,    33.06 tokens per second)
llama_print_timings:       total time =  8158.41 ms

{'output_text': '  Sure, I\'d be happy to help! Based on the context, here are some to task:\n\n1. LLM with simple prompting: This using a large model (LLM) with simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" to decompose tasks into smaller steps.\n2. Task-specific: Another is to use task-specific, such as "Write a story outline" for writing a novel, to guide the of tasks.\n3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\n\nAs fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.'}

六、检索问答

我们不需要手动传递文档,而是可以根据用户问题自动从向量存储中检索它们。

这将使用 QA 默认提示(此处显示)并从 vectorDB 中检索。

retriever = vectorstore.as_retriever()
qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

qa_chain.invoke(question)


Llama.generate: prefix-match hit

Sure! Based on the context, here’s my answer to your:

There are several to task,:

  1. LLM-based with simple prompting, such as “Steps for XYZ” or “What are the subgoals for achieving XYZ?”
  2. Task-specific, like “Write a story outline” for writing a novel.
  3. Human inputs to guide the process.

These can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error.


llama_print_timings:        load time = 11326.20 ms
llama_print_timings:      sample time =   139.20 ms /   200 runs   (    0.70 ms per token,  1436.76 tokens per second)
llama_print_timings: prompt eval time =  1532.26 ms /   258 tokens (    5.94 ms per token,   168.38 tokens per second)
llama_print_timings:        eval time =  5977.62 ms /   199 runs   (   30.04 ms per token,    33.29 tokens per second)
llama_print_timings:       total time =  7916.21 ms

{'query': 'What are the approaches to Task Decomposition?',
 'result': '  Sure! Based on the context, here\'s my answer to your:\n\nThere are several to task,:\n\n1. LLM-based with simple prompting, such as "Steps for XYZ" or "What are the subgoals for achieving XYZ?"\n2. Task-specific, like "Write a story outline" for writing a novel.\n3. Human inputs to guide the process.\n\nThese can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error.'}
```json

***
2024-05-24(五)  

标签:RAG,llama,0.2,LangChain,LLM,per,ms,time,print
From: https://blog.csdn.net/lovechris00/article/details/139185816

相关文章

  • 一起学大模型 - 动手写一写langchain调用本地大模型(2)
    文章目录前言一、自动选择1.使用AutoTokenizer和AutoModel的示例2.解释二、怎么实现自动选择的呢总结前言前一篇文章里,fromtransformersimportGPT2LMHeadModel,GPT2Tokenizer如果模型替换了,就得更改代码,很麻烦,那有没有更简单的方法呢?一、自动选择trans......
  • LangChain学习圣经:从0到1精通LLM大模型应用开发的基础框架
    文章很长,且持续更新,建议收藏起来,慢慢读!疯狂创客圈总目录博客园版为您奉上珍贵的学习资源:免费赠送:《尼恩Java面试宝典》持续更新+史上最全+面试必备2000页+面试必备+大厂必备+涨薪必备免费赠送:《尼恩技术圣经+高并发系列PDF》,帮你实现技术自由,完成职业升级,薪......
  • CentOS7.9部署安装OpenGauss 5.0.2企业版
    1、更新系统:yumupdate-y2、更改主机名:hostnamectlset-hostnameopendb013、关闭透明页:echonever>/sys/kernel/mm/transparent_hugepage/enabledechonever>/sys/kernel/mm/transparent_hugepage/defrag#加入开机自启动echo'echonever>/sys/kernel/......
  • 构建LangChain应用程序的示例代码:1、AutoGPT
    AutoGPT实现https://github.com/Significant-Gravitas/Auto-GPT,但是使用了LangChain的基础组件(大型语言模型(LLMs)、提示模板(PromptTemplates)、向量存储(VectorStores)、嵌入(Embeddings)、工具(Tools))。设置工具我们将设置一个带有搜索工具、写文件工具和读文件工具的......
  • 大模型应用之基于Langchain的测试用例生成
    一用例生成实践效果在组内的日常工作安排中,持续优化测试技术、提高测试效率始终是重点任务。近期,我们在探索实践使用大模型生成测试用例,期望能够借助其强大的自然语言处理能力,自动化地生成更全面和高质量的测试用例。当前,公司已经普及使用JoyCoder,我们可以拷贝相关需求及设计文......
  • 5分钟明白LangChain 的输出解析器和链
    本文介绍LangChain的输出解析器OutputParser的使用,和基于LangChain的LCEL构建链。1.输出解析器OutputParser1.1、为什么需要OutputParser常规的使用LangChain构建LLM应用的流程是:Prompt输入、调用LLM、LLM输出。有时候我们期望LLM给到的数据是格式化的数据,方便做后......
  • Langchain试用百度千帆
    之前聊了向量数据库,大模型也火了一段时间了,今天特地尝试一下基于Langchain进行百度千帆大模型的使用。Langchain相当于一个LLM编程框架,开发中无需过多关心各个大模型的接入,只需安装相关模型,统一通过Langchain去调用相关大模型进行使用。1.环境准备(1)python安装       ......
  • 【开源啦!】Langchain官方文档中文翻译项目 ——langchain-doc-zh
    【开源啦!】Langchain官方文档中文翻译项目——langchain-doc-zh一、项目简介LangChain是使用非常广的大模型编排工具,可以低代码的做大模型各种应用,有点类似在数据分析处理里面Pandas的地位。所以我有了一些想把一些工具的文档翻译成中文的想法。希望对于大家有一些帮助。......
  • 山东大学项目实训-基于LLM的中文法律文书生成系统(十四)- RAG(3)
    文档问答过程大概分为以下5部分,在Langchain中都有体现。上传解析文档文档向量化、存储文档召回query向量化文档问答今天主要讲langchain在文档embedding以及构建faiss过程时是怎么实现的。源码入口langchain中对于文档embedding以及构建faiss过程有2个分支,1.当第一次......
  • 山东大学项目实训-基于LLM的中文法律文书生成系统(十三)- RAG(2)
    今天主要讲langchain在上传解析文档时是怎么实现的。文档解析逻辑,以txt类型的文件解析为例子step1:寻找上传逻辑入口:local_doc_qa.py,关注TextLoader(),ChineseTextSplitter()defload_file(filepath,sentence_size=SENTENCE_SIZE,using_zh_title_enhance=ZH_TITLE_ENHANCE):......