首页 > 其他分享 >ES搜索排序,文档相关度评分介绍——Vector Space Model

ES搜索排序,文档相关度评分介绍——Vector Space Model

时间:2023-05-30 22:34:42浏览次数:40  
标签:happy Vector Space vectors vector query Model document hippopotamus

Vector Space Model

The vector space model provides a way of comparing a multiterm query against a document. The output is a single score that represents how well the document matches the query. In order to do this, the model represents both the document and the query as vectors.

A vector is really just a one-dimensional array containing numbers, for example:

[1,2,5,22,3,8]

In the vector space model, each number in the vector is the weight of a term, as calculated with term frequency/inverse document frequency.

While TF/IDF is the default way of calculating term weights for the vector space model, it is not the only way. Other models like Okapi-BM25 exist and are available in Elasticsearch. TF/IDF is the default because it is a simple, efficient algorithm that produces high-quality search results and has stood the test of time.

Imagine that we have a query for “happy hippopotamus.” A common word like happy will have a low weight, while an uncommon term like hippopotamus will have a high weight. Let’s assume that happyhas a weight of 2 and hippopotamus has a weight of 5. We can plot this simple two-dimensional vector—[2,5]—as a line on a graph starting at point (0,0) and ending at point (2,5), as shown inFigure 27, “A two-dimensional query vector for “happy hippopotamus” represented”.


Figure 27. A two-dimensional query vector for “happy hippopotamus” represented

ES搜索排序,文档相关度评分介绍——Vector Space Model_Elastic

 

Now, imagine we have three documents:

  1. I am happy in summer.
  2. After Christmas I’m a hippopotamus.
  3. The happy hippopotamus helped Harry.

We can create a similar vector for each document, consisting of the weight of each query term—happy and hippopotamus—that appears in the document, and plot these vectors on the same graph, as shown in Figure 28, “Query and document vectors for “happy hippopotamus””:

  • Document 1: (happy,____________)[2,0]
  • Document 2: ( ___ ,hippopotamus)[0,5]
  • Document 3: (happy,hippopotamus)[2,5]


Figure 28. Query and document vectors for “happy hippopotamus”

ES搜索排序,文档相关度评分介绍——Vector Space Model_ide_02

 

The nice thing about vectors is that they can be compared. By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document. The angle between document 1 and the query is large, so it is of low relevance. Document 2 is closer to the query, meaning that it is reasonably relevant, and document 3 is a perfect match.

In practice, only two-dimensional vectors (queries with two terms) can be plotted easily on a graph. Fortunately, linear algebra—the branch of mathematics that deals with vectors—provides tools to compare the angle between multidimensional vectors, which means that we can apply the same principles explained above to queries that consist of many terms.

You can read more about how to compare two vectors by using cosine similarity.

Now that we have talked about the theoretical basis of scoring, we can move on to see how scoring is implemented in Lucene.

标签:happy,Vector,Space,vectors,vector,query,Model,document,hippopotamus
From: https://blog.51cto.com/u_11908275/6382344

相关文章

  • 【Oracle】Check size of datafiles and tempfile tablespaces used in CDB and PDB
       --WX:DBAJOE399--setline200pages999columnnamefora10columntablespace_namefora15column"MAXSIZE(GB)"format9,999,990.00column"ALLOC(GB)"format9,999,990.00column"USED(GB)"format9,999,990.00selec......
  • python split space
    发现自己写python的空格split还挺多坎的,尤其是最后一个是空格的情形:defsplit(s):i=0ans=[]whilei<len(s):start=i#findspacewhilei<len(s)ands[i]!='':i+=1ans.append(s[start:i])......
  • 2023CVPR_Low-Light Image Enhancement via Structure Modeling and Guidance(代码暂
    大佬链接:CVPR2023低光照图像增强论文阅读基于结构先验的图像增强-知乎(zhihu.com)一motivation1.现有低光照图像增强方法忽视了在低光照区域结构信息建模对增强的作用(ignoretheexplicitmodelingofstructuraldetailsindarkareas)从而导致增强效果不理想,比如细节模......
  • Self-consistency Improves Chain of Thought Reasoning in Language Models 论文阅读
    ICLR2023原文地址1.MotivationChain-of-Thought(CoT)使LargeLanguageModels(LLMs)在复杂的推理任务中取得了令人鼓舞的结果。本文提出了一种新的解码策略——self-consistency,以取代贪婪解码。self-consistency利用了一种直觉,即一个复杂的推理问题通常允许多种不同的思维......
  • springboot集成themeleaf报Namespace 'th' is not bound问题的解决
    问题描述在我们想要在html前端页面使用th:符号时,发现他一直报错问题解决在html标签的最上方,也就是这里:加上这样一句代码:(加在html标签里面!!!)xmlns="http://www.w3.org/1999/xhtml"xmlns:th="http://www.thymeleaf.org"这样就能够解决这个问题啦!......
  • 关于 using namespace std
    我刚接触c++,写Hello,World是这个样子的#include<bits/stdc++.h>usingnamespacestd;intmain(){cout<<"Hello,World"<<endl;return0;}但是一直令我不解的是usingnamespacestd;这东西这么麻烦写他干嘛?今天我在写随机生成数据时发......
  • leetcode1657vector的初始化和比较
    满足相似的条件:1.长度一样2.组成的字母组合相同3.每个组成字母的个数集合相同比较两个vector,直接用==/!=排序vectorsort(迭代器1,迭代器2);初始化vector形式:vector<类型>name(形式)if(word1.lenth()!=word2.length())returnfalse;//长度不同vector<int>v2(26,0),v1(2......
  • Unity的AssetPostprocessor之Model:深入解析与实用案例 1
    UnityAssetPostprocessor模型相关函数详解在Unity中,AssetPostprocessor是一个非常有用的工具,它可以在导入资源时自动执行一些操作。在本文中,我们将重点介绍AssetPostprocessor中与模型相关的函数,并提供多个使用例子。OnPostprocessModelOnPostprocessModel是AssetPostprocessor......
  • 基于 Mindspore 框架与 ModelArts 平台的 MNIST 手写体识别实验
    简介实验包含2部分:基于Mindspore框架的模型本地训练及预测基于Modelarts平台和PyTorch框架的模型训练及部署基于Mindspore框架的模型本地训练及预测本例子会实现一个简单的图片分类的功能,整体流程如下:处理需要的数据集,这里使用了MNIST数据集。定义一个网络,这......
  • vivado2019.2对modelsim2019.2编译库全报错解析
    最近在用vivado2019.2编译modelsim2019.2库时,所有库全部报错,查阅了博主们的各种解决办法,最终在一篇文章的评论中找到了解决办法,特此记录问题描述:1、ERROR:[Vivado12-5602]compile_simlibfailedtocompileformodelsimwitherrorinxxxlibraries2、ERROR:[Common17-......