Elasticsearch：使用在本地计算机上运行的 LLM 以及 Ollama 和 Langchain 构建 RAG 应用程序

标签：RAG llama2 Langchain Elasticsearch LLM import model ollama Docker

无需 GPU 的隐私保护 LLM。在本博客中，我将演示使用不同的工具 Ollama 构建的 RAG 应用程序。与本文相关的所有源代码均已发布在 github上。请克隆存储库以跟随文章操作。我们可以通过如下的方式来克隆：

git clone https://github.com/liu-xiao-guo/ollama_es

什么是 Ollama?

Ollama 是一个轻量级且灵活的框架，专为在个人计算机上本地部署 LLM 而设计。它通过直观的 API 简化了 LLMs 的开发、执行和管理，并提供了一系列预配置模型，可立即在各种应用程序中使用。其设计的核心是将模型权重、配置和数据捆绑到一个统一的包中，并封装在模型文件中。

该框架具有一系列精选的预量化、优化模型，例如 Llama 2、Mistral 和 Gemma，可供部署。这些模型经过专门设计，可在标准消费类硬件（涵盖 CPU 和 GPU）上运行，并且与多种操作系统兼容，包括 macOS、Linux 和 Windows。这种方法消除了用户自己承担复杂的模型优化任务的必要性。

鉴于 LLMs 因其规模庞大而通常需要强大的 GPU 来进行操作，Ollama 支持的模型采用神经网络量化。这项技术极大地降低了硬件要求，使 LLMs 能够在没有互联网连接的情况下在常见计算设备上高效运行。因此，Ollama 使 LLM 技术变得更容易使用，使个人和组织能够在消费级硬件上利用这些先进的模型。

RAG 应用

该 RAG 应用程序包含一个定制的数据集，该数据集是从在线网站动态抓取的。用户可以通过 API（例如 REST API）与网站数据进行交互。出于演示目的，我们选择了 Open5GS 文档网站（Open5GS 是 5G Core 的 C 语言实现）。 Open5GS 文档中的数据被抓取、分割，然后作为向量嵌入存储在 Elasticsearch 向量数据库中。因此，用户可以通过 API 与 Open5GS 文档的内容无缝交互。

对于此 RAG 应用程序的 LLM 组件，我选择了 Llama2 7B 模型，该模型通过 Ollama 运行。在 Ollama 上运行的 Llama2 是 Meta 的基于 Llama-2 的 LLM，经过量化以在消费级硬件（例如 CPU）上实现最佳性能。在此 RAG 应用程序中，与 Ollama 一起运行的 Llama2 LLM 根据 Open5GS 文档中的内容提供用户问题的答案。通过 Langchain 促进了 RAG 应用程序和 LLM 的集成。

以下是 RAG 应用程序的主要功能。下图详细介绍了包含这些不同组件的综合功能架构。

1. 抓取网络数据

Langchain 提供不同类型的文档加载器来加载不同来源的数据作为文档。 RecursiveUrlLoader 就是这样一种文档加载器，可用于将 Web url 中的数据加载到文档中。此步骤使用 Langchain 的 RecursiveUrlLoader 从网络上抓取数据作为文档。 RecursiveUrlLoader 递归地将给定的 url 抓取到给定的 max_depth 并读取网络上的数据。该数据用于创建向量嵌入并回答用户的问题。

2. 分割文档

处理较长的文本时，必须将文本分成较小的片段。尽管这项任务看起来很简单，但它可能包含相当复杂的内容。目标是确保语义相关的文本片段保持在一起。 Langchain 文本分割器有效地完成了这项任务。本质上，它将文本划分为小的、具有语义意义的单元（通常是句子）。然后将这些较小的片段组合起来形成较大的块，直到它们达到由特定函数确定的特定大小。达到此大小后，该块将被指定为单独的文本片段，并且该过程会再次以一些重叠的方式开始。对于这个特定的场景，我使用了 RecursiveCharacterTextSplitter 将抓取的文档分割成可管理的块。

3. 创建向量嵌入

收集并分割数据后，下一步是将文本信息转换为向量嵌入。然后根据分割数据创建这些嵌入。文本嵌入对于 LLM 操作的运作至关重要。虽然使用自然语言处理语言模型在技术上是可行的，但存储和检索此类数据的效率非常低。为了提高效率，有必要将文本数据转换为向量形式。有专门为从文本创建嵌入而设计的专用机器学习模型。在本例中，我使用开源 HuggingFaceEmbedding 模型 all-MiniLM-L6-v2 来生成向量嵌入。因此，文本被转换为多维向量，这些向量本质上是捕获语义含义和上下文细微差别的高维数字表示。嵌入后，可以对这些数据进行分组、排序、搜索等。我们可以计算两个句子之间的距离来确定它们的相关程度。重要的是，这些操作超越了依赖关键字的传统数据库搜索，而是捕获句子之间的语义接近度。

4. 将向量嵌入存储在 Elasticsearch 中

然后生成的向量嵌入存储在 Elasticsearch 向量数据库中。Elasticsearch 是一个开放及免费的嵌入数据库，通过存储和检索嵌入及其元数据以及文档和查询，可以轻松构建 LLM 应用程序。 Elasticsearch 可以有效地处理这些嵌入，从而可以快速检索和比较基于文本的数据。传统数据库在精确查询方面效果很好，但在理解人类语言的细微差别方面却表现不佳。输入向量数据库，它是处理语义搜索的游戏规则改变者。与依赖于精确单词或短语的传统文本匹配不同。

5. 用户提问

系统提供API，用户可以通过该 API 提交问题。在此用例中，用户可以提出与 Open5GS 文档内容相关的任何问题。该 API 充当用户和聊天机器人之间交互的主要接口。该 API 采用一个参数 user_id，用于标识不同的用户会话。此 user_id 用于演示目的。在现实场景中，可以使用 HTTP 请求中的授权标头（例如 JWT Bearer 令牌）进行管理。该 API 的设计直观且易于访问，使用户能够轻松输入查询并接收响应。

6. 创建问题的向量嵌入

当用户通过 API 提交问题时，系统会将该问题转换为向量嵌入。嵌入的生成由 ConversationalRetrievalChain 自动处理。这有利于向量数据库内与问题相关的文档的语义搜索。

7. 语义搜索向量数据库

创建问题的向量嵌入后，系统会使用语义搜索来扫描向量数据库，识别与用户查询最相关的内容。通过将问题的向量嵌入与存储的数据的向量嵌入进行比较，系统可以准确地查明上下文相似或与查询相关的信息。在这种情况下，我使用了 ConversationalRetrievalChain，它根据输入查询自动处理语义搜索。然后，语义搜索的结果将被识别为 LLM 的上下文。

8. 生成提示

接下来，ConversationalRetrievalChain 会根据用户的问题和语义搜索结果（上下文）生成自定义提示。语言模型的提示是用户提供的一组指令或输入，用于指导模型的响应。这有助于模型理解上下文并生成相关且连贯的基于语言的输出，例如回答问题、完成句子或参与对话。

9. 向 LLM 提交提示

生成提示后，它会通过 Langchain 库 Ollama（Langchain 在 langchain_community.llms 中正式支持 Ollama）发布到 LLM（在我们的示例中为 Llama2 7B）。然后， LLM 根据提供的上下文找到问题的答案。 ConversationalRetrievalChain 处理将查询发布到 LLM 的功能。

10. LLM 生成答案

LLM 利用 Meta 的 Llama-2 的高级功能，在所提供内容的上下文中处理问题。然后它生成响应并将其发回。

11. 在 MongoDB 聊天记录中保存查询和响应

Langchain 提供了各种用于管理会话内存的组件。在这个聊天机器人中，MongoDB 用于会话内存的管理。在此阶段，用户的问题和聊天机器人的响应都作为聊天历史记录的一部分记录在 MongoDB 存储中。这种方法确保所有用户聊天历史记录都持久存储在 MongoDB 中，从而能够检索以前的交互。数据按每个用户会话存储在 MongoDB 中。为了区分用户会话，API 使用 user_id 参数，如前所述。这些历史数据对于塑造未来的互动至关重要。当同一用户提出后续问题时，聊天历史记录以及新的语义搜索结果（上下文）将被转发给 LLMs。此过程保证聊天机器人可以在整个对话过程中保持上下文，从而产生更精确和定制的响应。

在本例中，我们没有完成这个部分的功能。在实际的操作中，我们也可以使用 Elasticsearch 来代替 MongoDB 来完成这个功能。

12. 将答案发送回用户

最后，从 LLM 收到的响应通过 HTTP API 转发给用户。用户可以通过提供相同的 user_id 在后续请求中继续提出不同的问题。然后，系统会识别用户的聊天历史记录，并将其与新的语义搜索结果一起包含在发送给 LLM 的信息中。此过程可确保无缝且上下文相关的对话，从而丰富每次交互的用户体验。

安装

安装 Elasticsarch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana，请参考如下的链接来进行安装：

在安装的时候，我们选择 Elastic Stack 8.x 来进行安装。特别值得指出的是：ES|QL 只在 Elastic Stack 8.11 及以后得版本中才有。你需要下载 Elastic Stack 8.11 及以后得版本来进行安装。

在首次启动 Elasticsearch 的时候，我们可以看到如下的输出：

安装 Python 依赖包

我们在项目的根目录下使用如下的命令来创建虚拟的环境：

python3 -m venv .venv
source .venv/bin/activate

我们然后在虚拟的环境中打入如下的命令：

pip3 install -r requirements.txt

拷贝 Elasticsearch 证书

我们可以通过如下的方式把 Elasticsearch 的证书拷贝到当前的项目根目录下：

(.venv) $ pwd
/Users/liuxg/python/ollama
(.venv) $ cp ~/elastic/elasticsearch-8.12.0/config/certs/http_ca.crt .
overwrite ./http_ca.crt? (y/n [n])

你需要根据自己的证书的位置进行调整。

创建环境变量

我们在项目的当前根目录下，创建一个叫做 .env 的文件：

INIT_INDEX=true
TARGET_URL="https://open5gs.org/open5gs/docs/"
HTTP_PORT=7654
ES_USER="elastic"
ES_PASSWORD="hBZtXFwhz3HCtLi-cVc5"
ES_ENDPOINT="localhost"

(.venv) $ pwd
/Users/liuxg/python/ollama
(.venv) $ ls .env
.env
(.venv) $ cat .env
INIT_INDEX=true
TARGET_URL="https://open5gs.org/open5gs/docs/"
HTTP_PORT=7654
ES_USER="elastic"
ES_PASSWORD="hBZtXFwhz3HCtLi-cVc5"
ES_ENDPOINT="localhost"

它的内容如上。你需要根据自己的 Elasticsearch 配置进行相应的修改。

实施

下面详细介绍该 ChatBot 的完整实现。 ChatBot 代理的完整源代码可在 github 上访问和查看。

1) 配置

在 config.py 文件中，我定义了 ChatBot 中使用的各种配置。

config.py

import os
from dotenv import load_dotenv
from elasticsearch import Elasticsearch, helpers

load_dotenv()

# define init index
INIT_INDEX = os.getenv('INIT_INDEX', 'false').lower() == 'true'

# target url to scrape
TARGET_URL =  os.getenv('TARGET_URL', "https://open5gs.org/open5gs/docs/")

# http api port
HTTP_PORT = os.getenv('HTTP_PORT', 7654)

ES_USER = os.getenv("ES_USER")
ES_PASSWORD = os.getenv("ES_PASSWORD")
ES_ENDPOINT = os.getenv("ES_ENDPOINT")

elastic_index_name = "ollama_index"

url = f"https://{ES_USER}:{ES_PASSWORD}@{ES_ENDPOINT}:9200"

connection = Elasticsearch(
    hosts=[url], 
    ca_certs = "./http_ca.crt", 
    verify_certs = True
    )

2) 模型 - model

下面是模型的实现。它包含一个函数 init_index，该函数从给定的 Web URL 中抓取数据并创建向量存储。环境变量 INIT_INDEX 用于确定是否创建索引。 init_conversation 函数使用 Ollama 的 Llama2 LLM 初始化 ConversationalRetrievalChain，可通过 Ollama 的模型 REST API <host>:11434 获取（Ollama 提供用于与 LLM 交互的 REST API。有关如何使用此功能的详细说明和更多信息，请参阅到 Run Ollama Llama2 部分）。聊天功能负责向 LLM 发送问题。

model.py

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_community.vectorstores import ElasticsearchStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from bs4 import BeautifulSoup as Soup
from langchain.utils.html import (PREFIXES_TO_IGNORE_REGEX,
                                  SUFFIXES_TO_IGNORE_REGEX)

from elasticsearch import Elasticsearch, helpers

from config import *
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

global conversation
conversation = None

global vectordb
vectordb = None

def init_index():
    global vectordb
    
    if not INIT_INDEX:
        logging.info("continue without initializing index")
        return
    
    # scrape data from web
    documents = RecursiveUrlLoader(
        TARGET_URL,
        max_depth=4,
        extractor=lambda x: Soup(x, "html.parser").text,
        prevent_outside=True,
        use_async=True,
        timeout=600,
        check_response_status=True,
        # drop trailing / to avoid duplicate pages.
        link_regex=(
            f"href=[\"']{PREFIXES_TO_IGNORE_REGEX}((?:{SUFFIXES_TO_IGNORE_REGEX}.)*?)"
            r"(?:[\#'\"]|\/[\#'\"])"
        ),
    ).load()

    logging.info("index creating with `%d` documents", len(documents))

    # split text
    # this chunk_size and chunk_overlap effects to the prompt size
    # execeed promt size causes error `prompt size exceeds the context window size and cannot be processed`
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    documents = text_splitter.split_documents(documents)
    
    # create embeddings with huggingface embedding model `all-MiniLM-L6-v2`
    # then persist the vector index on vector db
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    
    if not connection.indices.exists(index=elastic_index_name):
        print("The index does not exist, going to generate embeddings")   
        vectordb = ElasticsearchStore.from_documents( 
                documents,
                embedding = embeddings, 
                es_url = url, 
                es_connection = connection,
                index_name = elastic_index_name, 
                es_user = ES_USER,
                es_password = ES_PASSWORD
        )
    else: 
        print("The index already existed")    
        vectordb = ElasticsearchStore(
            es_connection = connection,
            embedding = embeddings,
            es_url = url, 
            index_name = elastic_index_name, 
            es_user = ES_USER,
            es_password = ES_PASSWORD    
        )   

def init_conversation():
    global conversation
    global vectordb

    # llama2 llm which runs with ollama
    # ollama expose an api for the llam in `localhost:11434`
    llm = Ollama(
        model="llama2",
        base_url="http://localhost:11434",
        verbose=True,
    )

    # create conversation
    conversation = ConversationalRetrievalChain.from_llm(
        llm,
        retriever = vectordb.as_retriever(),
        return_source_documents = True,
        verbose = True,
    )

def chat(question, user_id):
    global conversation

    chat_history = []
    response = conversation({"question": question, "chat_history": chat_history})
    answer = response['answer']

    logging.info("got response from llm - %s", answer)

    # TODO save history

    return answer

3) HTTP API

HTTP API 的实现在 api.py 中进行。此 API 包括 HTTP POST 端点 api/question，它接受包含问题和 user_id 的 JSON 对象。 user_id 用于演示目的。在实际应用程序中，这可以通过 HTTP 请求中的授权标头（例如 JWT Bearer 令牌）进行管理。当收到用户的问题请求时，它会被转发到 ChatBot 模型中的聊天功能。

from flask import Flask
from flask import jsonify
from flask import request
from flask_cors import CORS
import logging
import sys
from model import init_index
from model import init_conversation
from model import chat
from config import *

app = Flask(__name__)
CORS(app)

init_index()
init_conversation()

logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

@app.route("/")
def index():
    return "<p>Hello, World!</p>"

@app.route('/api/question', methods=['POST'])
def post_question():
    json = request.get_json(silent=True)
    question = json['question']
    user_id = json['user_id']
    logging.info("post question `%s` for user `%s`", question, user_id)

    resp = chat(question, user_id)
    data = {'answer':resp}

    return jsonify(data), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=HTTP_PORT, debug=True)

运行应用

以下是操作 ChatBot 应用程序并与之交互的主要步骤。可以使用 HTTP API 提交问题，并将收到相应的响应。

运行 Ollama LIama2

Ollama 提供了多种部署选项，使其能够作为独立的二进制文件在 macOS、Linux 或 Windows 以及 Docker 容器中运行。这种灵活性确保用户可以在自己喜欢的平台上轻松设置 LLM 并与 LLM 进行交互。 Ollama 支持命令行和 REST API 交互，允许无缝集成到各种工作流程和应用程序中。其实用性的一个例子是通过 Ollama 运行 Llama2 模型，展示了其有效托管和管理 LLM 的能力。下面是使用 Docker 部署 Ollama 的示例方法，重点介绍了我在此平台上运行 Llama2 模型的经验。

docker run -d -v $(PWD)/data:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

$ docker run -d -v $(PWD)/data:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Unable to find image 'ollama/ollama:latest' locally
latest: Pulling from ollama/ollama
f4bb4e8dca02: Pull complete 
aeed191ede92: Pull complete 
920e2a93e5f4: Pull complete 
Digest: sha256:75fb97cdd8f435756d26eb7783599122b814c4b97da2012dc21ebc0efbd90fef
Status: Downloaded newer image for ollama/ollama:latest
a7a4fc0bed77b910ee30a5c200ee9a069f47e3974f91c2eda6cda67f353276bb

docker exec -it ollama bash

$ docker exec -it ollama bash
root@a7a4fc0bed77:/# ollama run llama2
pulling manifest 
pulling 8934d96d3f08... 100% ▕███████████████████████████████▏ 3.8 GB                         
pulling 8c17c2ebb0ea... 100% ▕███████████████████████████████▏ 7.0 KB                         
pulling 7c23fb36d801... 100% ▕███████████████████████████████▏ 4.8 KB                         
pulling 2e0493f67d0c... 100% ▕███████████████████████████████▏   59 B                         
pulling fa304d675061... 100% ▕███████████████████████████████▏   91 B                         
pulling 42ba7f8a01dd... 100% ▕███████████████████████████████▏  557 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
>>> /byte
Unknown command '/byte'. Type /? for help
>>> /?
Available Commands:
  /set            Set session variables
  /show           Show model information
  /load <model>   Load a session or model
  /save <model>   Save your current session
  /bye            Exit
  /?, /help       Help for a command
  /? shortcuts    Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> /bye
root@a7a4fc0bed77:/# ollama list
NAME         	ID          	SIZE  	MODIFIED       
llama2:latest	78e26419b446	3.8 GB	19 minutes ago	
root@a7a4fc0bed77:/# ollama run llama2
>>> what is docker?

Docker is a computer program that creates software containers, which are lightweight and 
portable environments for applications to run in. It was first released in 2013 by Docker
Inc., and has since become one of the most popular and widely-used containerization 
platforms.

Docker allows developers to package an application and its dependencies into a single 
container that can be run on any system that has Docker installed, regardless of the 
underlying architecture or operating system. This makes it easy to deploy applications 
across different environments, such as from development to testing to production, without
worrying about compatibility issues.

Here are some key features of Docker:

1. Containerization: Docker creates lightweight containers for applications, which 
includes everything an application needs to run, including code, libraries, and 
dependencies.
2. Portability: Docker containers are platform-independent, meaning they can be run on 
any system that has Docker installed, regardless of the underlying architecture or 
operating system.
3. Isolation: Docker containers are isolated from each other and from the host system, 
which means they do not interfere with each other or access sensitive information outside
of their own container.
4. Efficiency: Docker containers use less resources than traditional virtualization 
methods, making them faster to spin up and more efficient in terms of memory and CPU 
usage.
5. Security: Docker provides a secure environment for applications by isolating them from
the host system and limiting access to sensitive resources.
6. Networking: Docker provides a built-in networking system that allows containers to 
communicate with each other and with external services.
7. Version control: Docker allows developers to manage different versions of their 
applications by creating and managing different containers for each version.
8. Collaboration: Docker makes it easy for multiple developers to work on the same 
application by allowing them to create and share containers.
9. Automated workflows: Docker provides a range of tools for automating the deployment 
and management of applications, such as continuous integration and continuous deployment 
(CI/CD) pipelines.

Docker is used in a variety of industries and use cases, including:
...
>>> exit
>>> exit

Thank you for using Docker! If you have any questions or need further assistance, feel 
free to ask. Otherwise, it was nice working with you. Have a great day and happy coding! 
标签：RAG,llama2,Langchain,Elasticsearch,LLM,import,model,ollama,Docker	

From： https://blog.csdn.net/UbuntuTouch/article/details/137007788