本系列前两篇文章深入探讨了 PieCloudVector 在图片和音频数据上的应用之后,本文将聚焦于文本数据,探索 PieCloudVector 对于文本数据的向量化处理、存储以及检索,并最终结合 LLM 打造聊天机器人的全流程。
在自然语言处理任务中涉及到大量对文本数据的处理、分析和理解,而向量数据库在其中发挥了重要的作用。本文为《PieCloudVector 进阶系列》的第三篇,将为大家介绍如何利用 PieCloudVector 打造聊天机器人(Chatbot)。
PieCloudVector + LLM 打造问答聊天机器人
基于 PieCloudVector 打造聊天机器人的过程可分为三个步骤,具体流程如下:
数据的向量化和存储: 通过语言 embedding 模型进行文字的向量化处理,并将其存储至 PieCloudVector。
构建检索流程: 当用户提出查询时,将查询转化为向量,并在向量数据库中执行相似性搜索,找到最相关的向量,即最匹配的信息片段。
打造聊天机器人: 结合大语言模型(gpt-3.5 等)和提示词工程,根据用户的问题,生成自然语言回答。
基于 PieCloudVector 打造 LLM 为基础的聊天机器人
当用户提出问题后,系统将执行与用户问题相关的文本检索。随后,检索到的数据将被发送给大型语言模型,模型根据接收到的内容生成相应的回答。本文将结合实例演示,重点介绍前两个步骤:数据的向量化和存储、构建检索流程。
在处理文字数据时,我们采用与图片数据处理相似的方法:利用语言 Embedding 模型将文本转换成向量形式,将这些向量存储到数据库中。下面我们将详细介绍如何将文本数据转换为向量形式并进行相似查询的过程。
- 数据集准备
实例中使用的 Wikipedia 数据集依旧来自 Hugging Face,详情见 Wikipedia [1]。
from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:500]")
该数据集有以下所示四个特征:
In: dataset. features
Out: {'id': Value(dtype='string', id=None),
'url': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None)}
以第一条数据为例,每个特征所对应的数据如下:
print("text id:", dataset['id'][0])
print("text url:", dataset['url'][0])
print("text title:", dataset['title'][0])
print("text content:", dataset['text'][0])
打印结果如下图所示:
text id: 1
text url: https://simple.wikipedia.org/wiki/April
text title: April
text content: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.
bk.re-shake.com/D30eq.wAP
bk.sdymsxfh.com/cliq/0wSwf.Wap
bk.qcbysq.com/D30eq.wAP
bk.62nsfs.com/cliq/c5W4R.WAp
bk.jinduoceramics.com/dcwl/fX8Xw.WaP
bk.zcyxsm.com/D30eq.wAP
bk.sdymsxfh.com/dcwl/D30eq.wAP
bk.62nsfs.com/asd7/f8W4d.wAP
bk.huanbao580.com/f8W4d.wAP
bk.hndsedu.com/beyn/f8W4d.wAP
April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.
April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.
The Month
April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.
April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.
从结果中可以看出,第一条数据包含了关于四月的维基百科英文文本,其对应的标识符(id)是 1。在将这些数据存储到数据库时,我们会保留除了 “content” 字段之外的所有特征数据,对于 “content” 字段,我们将用 Embedding 模型生成的向量结果来替换原有的文本内容。
- 向量化处理与相似搜索
这里使用的文本数据 Embedding 模型是 Hugging Face 提供的 paraphrase-MiniLM-L6-v2 [2],这是一个相对轻量级的模型,特别适合进行文本聚类和相似性查询。在加载这个模型的过程中,我们还需要 sentence_transformers 这个工具。
下面将加载目标 Embedding 模型并将文字数据转换为向量。
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings = emb_model.encode(dataset['text'])
embeddings_lst = embeddings.tolist()
一条文字数据输入模型后,最终输出结果是一个 384 维的向量。
接下来我们会将数据及向量结果写入数据库,在此之前我们需要先在数据库中创建对应的表。
CREATE TABLE wiki_text (id int PRIMARY KEY, url text, title text, embedding vector);
使用 Postgres 驱动将数据写入 PieCloudVector。
import psycopg2
conn = psycopg2.connect('postgresql://usr:[email protected]..:5432/db')
cur = conn.cursor()
for i in range(1,len(embeddings_lst)):
cur.execute('INSERT INTO wiki_text (id, url, title, embedding) values (%s,%s,%s,%s)', (dataset['id'][i],dataset['url'][i],dataset['title'][i],embeddings_lst[i]))
conn.commit()
conn.close()
使用 L2 Distance 寻找最相似的 10 条文档。
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://usr:[email protected]..:5432/db', echo=False)
text_id = pd.read_sql('select id, title from wiki_text order by embedding <-> ' + "'" + str(embeddings_lst[0]) + "'" + ' limit 10',
con=engine)
结果如下图。
我们观察到,除了第七条之外,所有文章都与月份和年份相关。第七条名为 “Alanis Morissette” 的数据是一篇介绍某位歌手生平的文章,其内容如下所示:
In: data_14 = dataset.filter(lambda x: x['id']=='14')
print(data_14['text'][0])
Out: Alanis Nadine Morissette (born June 1, 1974) is a Grammy Award-winning Canadian-American singer and songwriter. She was born in Ottawa, Canada. She began singing in Canada as a teenager in 1990. In 1995, she became popular all over the world.
As a young child in Canada, Morissette began to act on television, including 5 episodes of the long-running series, You Can't Do That on Television. Her first album was released only in Canada in 1990.
Her first international album was Jagged Little Pill, released in 1995. It was a rock-influenced album. Jagged has sold more than 33 million units globally. It became the best-selling debut album in music history. Her next album, Supposed Former Infatuation Junkie, was released in 1998. It was a success as well. Morissette took up producing duties for her next albums, which include Under Rug Swept, So-Called Chaos and Flavors of Entanglement. Morissette has sold more than 60 million albums worldwide.
She also acted in several movies, including Kevin Smith's Dogma, where she played God.
About her life
Alanis Morissette was born in Riverside Hospital of Ottawa in Ottawa, Ontario. Her father is French-Canadian. Her mother is from Hungary. She has an older brother, Chad, and a twin brother, Wade, who is 12 minutes younger than she is. Her parents had worked as teachers at a military base in Lahr, Germany.
Morissette became an American citizen in 2005. She is still Canadian citizen.
On May 22, 2010, Morissette married rapper Mario "MC Souleye" Treadway.
Jagged Little Pill
Morissette has had many albums. Her 1995 album Jagged Little Pill became a very popular album. It has sold over 30 million copies worldwide. The album caused Morissette to win four Grammy Awards. The album Jagged Little Pill touched many people.
这里会关联到这篇文章,可能是因为这篇生平介绍中频繁提及了时间信息,导致模型判断该文章与月份、年份相关度较高。
- 向量索引:模糊查询与精确查询
在之前的例子中,我们处理的数据集规模较小,主要是为了示范目的。但随着数据量的增加,精确查询需要将输入的向量与数据库中的每一条记录进行对比,这会导致计算压力随着数据量的增长而增大。而使用向量索引(Index)可以提前获取数据间的大致关系,能显著提高查询速度(可能会牺牲一定的查询精度),这种查询方式也被称为模糊查询。
模糊查询的核心思想是利用近似最近邻(ANN)算法来构建索引,在查询时可以预先确定向量之间的关系,避免了扫描整个数据表的过程。PieCloudVector 在创建索引时提供了两种 ANN 算法:IVFFlat 和 HNSW。 我们可以根据数据的具体特征来选择最合适的索引算法。
以上述的 Wikipedia 数据为例,在月份的例子中,我们截取了数据训练集中的前 500 条数据,而下面我们将截取 Wikipedia 数据集中的前 8000 条数据来演示模糊查询。
注意,当数据已经建立索引时,PieCloudVector 默认会启用模糊查询功能。我们可以使用以下命令,在当前会话中关闭模糊查询。
set enable_indexscan to off
首先,我们重新加载 Wikipedia 数据集以及 Embedding 模型,将文本数据转换为向量。
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
dataset_large = load_dataset("wikipedia", "20220301.simple", split="train[:8000]")
emb_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings_large = emb_model.encode(dataset_large['text'])
接着我们在数据库中为数据建立目标表。
CREATE TABLE wiki_text_8000 (id int PRIMARY KEY, url text, title text, embedding vector(384));
将处理好的向量写入数据库。由于数据量较大,这里我们采用批量写入的方式。
from psycopg2.extras import execute_values
conn = psycopg2.connect('postgresql://usr:[email protected]..:5432/db')
cur = conn.cursor()
bk.vwotech.com/jasl/c5W4R.WAp
bk.jiaforhui.com/deyz/fX8Xw.WaP
bk.gzysart.com/cliq/c4FSw.wAp
bk.jyh01.com/c5W4R.WAp
bk.51yjjy.com/0wSwf.Wap
bk.szlcdpq.com/jasl/0wSwf.Wap
bk.jyh01.com/asd7/f8W4d.wAP
bk.yjh9988.com/fX8Xw.WaP
bk.shuixitech.com/fX8Xw.WaP
bk.kfamaw.com/dcwl/c4FSw.wAp
data_list = [(dataset_large['id'][i],dataset_large['url'][i],dataset_large['title'][i],embeddings_large.tolist()[i]) for i in range(1,len(embeddings_large.tolist()))]
execute_values(cur, 'INSERT INTO wiki_text_8000 (id, url, title, embedding) values %s', data_list)
conn.commit()
conn.close()
接下来我们选择索引算法,PieCloudVector 支持两种 ANN 算法:
IVFFlat: 提前对数据进行分组,可以在进行模糊查询时,快速检索出与目标向量相似的组内数据。这种方法的优势在于查询速度快,内存消耗较少,但查询精准度一般。
HNSW: 在数据间建立 “关系网”,建立时间较长,对内存消耗较大,但精准度优于 IVFFlat。
当数据量不是特别大时,我们更倾向于追求查询的精确度,因此这里我们选择 HNSW 算法。在构建索引时,我们需要为索引指定一种距离度量方法。例如,我们在上文搜索相关文章时用到的是 L2 距离,这意味着我们需要为 L2 距离创建一个 HNSW 索引。同样地,如果我们选择使用余弦距离,我们也需要为余弦距离创建一个对应的索引。
我们为 L2 距离算法创建 HNSW 索引。
CREATE INDEX ON wiki_text_8000 USING pdb_nn (embedding vector_l2_ops) WITH (dimension = '384', index_key = 'HNSW32', search_k=10, hnsw_efsearch = 16, hnsw_efconstruction = 32);
进行模糊查询。
set enable_indexscan to on; -- 确保模糊查询开启,默认开启
select id, title from wiki_text_8000 where id != 2 order by embedding <-> (select embedding from wiki_text_8000 where id = 2) limit 10;
结果如下:
关闭模糊查询,进行精准查询。
set enable_indexscan to off;
select id, title from wiki_text_8000 where id != 2 order by embedding <-> (select embedding from wiki_text_8000 where id = 2) limit 10;
结果如下:
可以看到,通过模糊查询,我们能够快速获得与精确查询一致的结果,这不仅提高了查询效率,同时保证了结果的准确性。 在实际应用中,这意味着我们可以为用户提供快速且更准确的搜索结果,速度与准确的平衡在很多业务场景中都是非常重要的。在未来,随着数据量的增长,我们可以根据实际情况调整索引策略,以保持查询的效率和准确性,确保系统能够持续满足用户的需求。
标签:进阶,text,dataset,LLM,PieCloudVector,查询,com,id,bk From: https://www.cnblogs.com/XX-SHE/p/18535920