首页 > 其他分享 >openAI cookbook - embedding

openAI cookbook - embedding

时间:2023-04-27 14:00:49浏览次数:68  
标签:search embeddings text openAI embedding query model cookbook

https://github.com/openai/openai-cookbook

 

Embedding是什么意思就不说了

基于大模型的Embedding本身是包含比文本更多的内涵的,因为包含了大量的相关性

但Embedding怎么用,基本逻辑是文本相似性

所以Semantic search是最简单的,把Embedding存到向量数据库里面,search就行

推荐也是类似的

问答就要多一步,把检索出来的文本作为输入给到大模型,让大模型给出更准确的答案

Semantic search

Embeddings can be used for search either by themselves or as a feature in a larger system.

The simplest way to use embeddings for search is as follows:

  • Before the search (precompute):
    • Split your text corpus into chunks smaller than the token limit (8,191 tokens for text-embedding-ada-002)
    • Embed each chunk of text
    • Store those embeddings in your own database or in a vector search provider like PineconeWeaviate or Qdrant
  • At the time of the search (live compute):
    • Embed the search query
    • Find the closest embeddings in your database
    • Return the top results

An example of how to use embeddings for search is shown in Semantic_text_search_using_embeddings.ipynb.

In more advanced search systems, the cosine similarity of embeddings can be used as one feature among many in ranking search results.

Question answering

The best way to get reliably honest answers from GPT-3 is to give it source documents in which it can locate correct answers. Using the semantic search procedure above, you can cheaply search a corpus of documents for relevant information and then give that information to GPT-3, via the prompt, to answer a question. We demonstrate in Question_answering_using_embeddings.ipynb.

Recommendations

Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set.

An example of how to use embeddings for recommendations is shown in Recommendation_using_embeddings.ipynb.

Similar to search, these cosine similarity scores can either be used on their own to rank items or as features in larger ranking algorithms.

Customizing Embeddings

Although OpenAI's embedding model weights cannot be fine-tuned, you can nevertheless use training data to customize embeddings to your application.

In Customizing_embeddings.ipynb, we provide an example method for customizing your embeddings using training data. The idea of the method is to train a custom matrix to multiply embedding vectors by in order to get new customized embeddings. With good training data, this custom matrix will help emphasize the features relevant to your training labels. You can equivalently consider the matrix multiplication as (a) a modification of the embeddings or (b) a modification of the distance function used to measure the distances between embeddings.

 

获取embedding的方法很简单,

输入给出input和模型就可以

import openai

embedding = openai.Embedding.create(
    input="Your text goes here", model="text-embedding-ada-002"
)["data"][0]["embedding"]
len(embedding)

 

重点看下QA的场景,

如果你要让GPT回答一些他不知道的知识时,应该怎么做?

What should you do if you want GPT to answer questions about unfamiliar topics? E.g.,

  • Recent events after Sep 2021
  • Your non-public documents
  • Information from past conversations
  • etc.

This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference text.

  1. Search: search your library of text for relevant text sections
  2. Ask: insert the retrieved text sections into a message to GPT and ask it the question

 

这段说的很清楚,为何embedding的方式要好过finetune

finetune可能更适合,teach一种task或styles,一种模式;对于知识,我的理解,finetune的量不足以对存量有明显的影响

所以采用输入的方式会更好,但这里的问题是,模型的input是有限的,gpt-3.5只有4096个tokens,怎么解决

Why search is better than fine-tuning

GPT can learn knowledge in two ways:

  • Via model weights (i.e., fine-tune the model on a training set)
  • Via model inputs (i.e., insert the knowledge into an input message)

Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall.

As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.

In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

One downside of text search relative to fine-tuning is that each model is limited by a maximum amount of text it can read at once:

ModelMaximum text length
gpt-3.5-turbo 4,096 tokens (~5 pages)
gpt-4 8,192 tokens (~10 pages)
gpt-4-32k 32,768 tokens (~40 pages)

Continuing the analogy, you can think of the model like a student who can only look at a few pages of notes at a time, despite potentially having shelves of textbooks to draw upon.

Therefore, to build a system capable of drawing upon large quantities of text to answer questions, we recommend using a Search-Ask approach.

 

这里给出的答案是通过seach过滤出相关的文本

Text can be searched in many ways. E.g.,

  • Lexical-based search
  • Graph-based search
  • Embedding-based search

This example notebook uses embedding-based search. Embeddings are simple to implement and work especially well with questions, as questions often don't lexically overlap with their answers.

Consider embeddings-only search as a starting point for your own system.
Better search systems might combine multiple search methods, along with features like popularity, recency, user history, redundancy with prior search results, click rate data, etc.
Q&A retrieval performance may also be improved with techniques like HyDE, in which questions are first transformed into hypothetical answers before being embedded.
Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.

 

所以一个QA的完整的过程如下

Full procedure

Specifically, this notebook demonstrates the following procedure:

  1. Prepare search data (once)
    1. Collect: We'll download a few hundred Wikipedia articles about the 2022 Olympics
    2. Chunk: Documents are split into short, mostly self-contained sections to be embedded
    3. Embed: Each section is embedded with the OpenAI API
    4. Store: Embeddings are saved (for large datasets, use a vector database)
  2. Search (once per query)
    1. Given a user question, generate an embedding for the query from the OpenAI API
    2. Using the embeddings, rank the text sections by relevance to the query
  3. Ask (once per query)
    1. Insert the question and the most relevant sections into a message to GPT
    2. Return GPT's answer

看例子,

这里先做了一个试验,

直接问gpt

Which athletes won the gold medal in curling at the 2022 Winter Olympics?

他是不知道的

那正确的做法是,在prompt把相关的上下文资料给gpt,那么gpt就能回答

query = f"""Use the below article on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{wikipedia_article_on_curling}
\"\"\"

Question: Which athletes won the gold medal in curling at the 2022 Winter Olympics?"""

直接看如何ask的代码实例,

这里他search没有用向量数据库,直接在内存里面用spatial.distance.cosine算的,主要是作为例子简单

这里需要注意的是,把search到的资料传给openai的时候最好token化一下,判断一下token bucket

其他的步骤就很直觉了

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]


def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

 

对于向量数据库的使用,

https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb

这里也给出每一种的例子

 

对于文本太长,如何生成embedding,

这里给出两种方案,truncate很自然

还可以chunking

把一个文本分成多段,生成多个embedding

然后有两种选择,分开用,这个对于search,好像也没问题

或者把多个embedding合并成一个,例子里面给的是average的方法

https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

 

 

 

 

 

 

 

 

 

标签:search,embeddings,text,openAI,embedding,query,model,cookbook
From: https://www.cnblogs.com/fxjwind/p/17358715.html

相关文章

  • openAI cookbook - UT
    和原例子不一样,我没有用API,直接用的chatgpt3.5如何形成比较好的UTprompt,要分步骤,这里分成三步,我是把每一步copy到chatgpt,然后把结果贴回来Complextasks,suchaswritingunittests,canbenefitfrommulti-stepprompts.Incontrasttoasingleprompt,amulti-steppro......
  • ChatGPT:宝塔面板中nginx配置代理访问openai
    反向代理配置代码点击查看代码#PROXY-START/location/{proxy_passhttps://api.openai.com;proxy_set_headerHost$proxy_host;proxy_set_headerX-Real-IP$remote_addr;proxy_set_headerX-Forwarded-For$proxy_add_x_forwarded_for;proxy_se......
  • OpenAI的离线音频转文本模型 Whisper 的.NET封装项目
    whisper介绍OpenAI在2022年9月21日开源了号称其英文语音辨识能力已达到人类水准的Whisper神经网络,且它亦支持其它98种语言的自动语音辨识。Whisper系统所提供的自动语音辨识(AutomaticSpeechRecognition,ASR)模型是被训练来运行语音辨识与翻译任务的,它们能将各种语言的语音变成文......
  • OpenAI的子词标记化神器--tiktoken 以及 .NET 支持库SharpToken
    经过Tokenize之后,一串文本就变成了一串整数组成的向量。OpenAI的 Tiktoken 是更高级的Tokenizer,编码效率更高、支持更大的词汇表、计算性能也更高。OpenAI在其官方GitHub上公开了一个开源Python库:tiktoken,这个库主要是用力做字节编码对的。字节编码对(BytePairEncoder......
  • 冰橙GPT提供开放接口 。提供与OPENAI官方一致的体验效果(同步返回数据,同时支持流式及非
    冰橙GPTchatGPT开放接口使用说明 【接入了腾讯云内容安全检测】冰橙GPT稳定提供API接口服务定时有人进行问题排查处理1小时内问题响应接入了腾讯云的内容安全检测有任何疑问请加入QQ交流群:310872519           1.请求地址:https://gpt.bcwhkj.cn/a......
  • 我的OpenAI库发布了!!!
    chatGPT正式发布已经有段时间了,这段时间我也深度体验了chatGPT的魅力。OpenAI除了提供网页版的chatGPT,还通过api的形式提供了很多其它服务,包括文字纠错、图片生成、音频转换等等。作为程序员,即使有现成的openai库,但还是免不了想自己造轮子,所以就有这个openai库。当前这个库刚刚......
  • 3D Diffusion模型来了!OpenAI出品,已开源
    文|天于刀刀2022年不愧是AIGC行业元年。伴随着ChatGPT的大火使得谷歌一周之内改口“会认真评估ChatGPT对搜索引擎的影响”,OpenAI在3D图像生成领域进一步放出了大招开源项目“Point-E”[1],可玩程度不下于ChatGPT!简单来说,用户可以输入一连串文字prompt内容,只需要短短18......
  • ChatGPT闲谈——火出圈的为什么是 OpenAI?
    ChatGPT走入大众视野之后,AIGC行业迎来了爆发,尤其是上个月,仿佛每一天都可能是「历史性」的一天。现在各大网站已经有非常多的优秀创作者进行总结和分析,都是值得一阅的好文。今天本文也分享了关于ChatGPT的看法,有谈到技术,也有关于ChatGPT的闲谈,看完这篇文章,相信你会对ChatGPT有......
  • NLP深度网络中self.embedding(x)词嵌入后降维方法
    在自然语言处理中的循环神经网络中,经常使用torch定义类,self.embedding(x)中,x是输入,介绍self.embedding(x)返回结果,以及结果的形状,并解释这个形状在自然语言处理中的循环神经网络中,使用PyTorch定义类时通常会包含一个嵌入层(embeddinglayer)。在嵌入层中,使用self.embedding(x)语......
  • Attributed Graph Clustering |A Deep Attentional Embedding Approach
    论文阅读01-AttributedGraphClustering:ADeepAttentionalEmbeddingApproach1.创新点ideaTwo-step的图嵌入方法不是目标导向的,聚类效果不好,提出一种基于目标导向的属性图聚类框架。所谓目标导向,就是说特征提取和聚类任务不是独立的,提取的特征要在一定程度上有利于聚类......