首页 > 编程语言 >55_初识搜索引擎_相关度评分TF&IDF算法独家解密

55_初识搜索引擎_相关度评分TF&IDF算法独家解密

时间:2024-10-02 13:04:04浏览次数:8  
标签:index description field 55 value TF test details IDF

课程大纲

1、算法介绍

relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度

Elasticsearch使用的是 term frequency/inverse document frequency算法,简称为TF/IDF算法

Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关

搜索请求:hello world

doc1:hello you, and world is very good
doc2:hello, how are you

Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关

搜索请求:hello world

doc1:hello, today is very good
doc2:hi world, how are you

比如说,在index中有1万条document,hello这个单词在所有的document中,一共出现了1000次;world这个单词在所有的document中,一共出现了100次

doc2更相关

Field-length norm:field长度,field越长,相关度越弱

搜索请求:hello world

doc1:{ "title": "hello article", "content": "babaaba 1万个单词" }
doc2:{ "title": "my article", "content": "blablabala 1万个单词,hi world" }

hello world在整个index中出现的次数是一样多的

doc1更相关,title field更短

2、_score是如何被计算出来的

GET /test_index/test_type/_search?explain
{
"query": {
"match": {
"test_field": "test hello"
}
}
}

{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.595089,
"hits": [
{
"_shard": "[test_index][2]",
"_node": "4onsTYVZTjGvIj9_spWz2w",
"_index": "test_index",
"_type": "test_type",
"_id": "20",
"_score": 1.595089,
"_source": {
"test_field": "test hello"
},
"_explanation": {
"value": 1.595089,
"description": "sum of:",
"details": [
{
"value": 1.595089,
"description": "sum of:",
"details": [
{
"value": 0.58279467,
"description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.58279467,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.6931472,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 4,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.840795,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1.75,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0122943,
"description": "weight(test_field:hello in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.0122943,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 1.2039728,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 4,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.840795,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1.75,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": ":, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[test_index][2]",
"_node": "4onsTYVZTjGvIj9_spWz2w",
"_index": "test_index",
"_type": "test_type",
"_id": "6",
"_score": 0.58279467,
"_source": {
"test_field": "tes test"
},
"_explanation": {
"value": 0.58279467,
"description": "sum of:",
"details": [
{
"value": 0.58279467,
"description": "sum of:",
"details": [
{
"value": 0.58279467,
"description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.58279467,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.6931472,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 4,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.840795,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1.75,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": ":, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[test_index][3]",
"_node": "4onsTYVZTjGvIj9_spWz2w",
"_index": "test_index",
"_type": "test_type",
"_id": "7",
"_score": 0.5565415,
"_source": {
"test_field": "test client 2"
},
"_explanation": {
"value": 0.5565415,
"description": "sum of:",
"details": [
{
"value": 0.5565415,
"description": "sum of:",
"details": [
{
"value": 0.5565415,
"description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.5565415,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.6931472,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.8029196,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 4,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "_type:test_type, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[test_index][1]",
"_node": "4onsTYVZTjGvIj9_spWz2w",
"_index": "test_index",
"_type": "test_type",
"_id": "8",
"_score": 0.25316024,
"_source": {
"test_field": "test client 2"
},
"_explanation": {
"value": 0.25316024,
"description": "sum of:",
"details": [
{
"value": 0.25316024,
"description": "sum of:",
"details": [
{
"value": 0.25316024,
"description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.25316024,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.88,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 3,
"description": "avgFieldLength",
"details": []
},
{
"value": 4,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": ":, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}
]
}
}

3、分析一个document是如何被匹配上的

GET /test_index/test_type/6/_explain
{
"query": {
"match": {
"test_field": "test hello"
}
}
}

标签:index,description,field,55,value,TF,test,details,IDF
From: https://www.cnblogs.com/siben/p/18444580

相关文章

  • 代码随想录算法训练营 | 贪心算法:455.分发饼干,376. 摆动序列,53. 最大子序和
    455.分发饼干题目链接:455.分发饼干文档讲解︰代码随想录(programmercarl.com)视频讲解︰分发饼干日期:2024-10-02想法:大饼干喂大孩子Java代码如下:classSolution{publicintfindContentChildren(int[]g,int[]s){Arrays.sort(g);Arrays.sort(s);......
  • System.out.printf
    程序示例:importjava.util.Scanner;publicclassTest{publicstaticvoidmain(String[]args){System.out.print("请输入你的名字:");Scannerin=newScanner(System.in);Stringname=in.nextLine();System.out.print(&......
  • [CFI-CTF 2018]webLogon capture
    [CFI-CTF2018]webLogoncapture打开附件发现是流量分析题追踪TCP流发现密码解密得到flag,CFI{1ns3cur3_l0g0n}importbinasciistr='%20%43%46%49%7b%31%6e%73%33%63%75%72%33%5f%6c%30%67%30%6e%7d%20';print(binascii.a2b_hex(str.replace('%','')))......
  • 1068:与指定数字相同的数的个数 【printf和scanf的非常正常用法】
    【题目描述】输出一个整数序列中与指定数字相同的数的个数。输入包含2行:第1行为N和m,表示整数序列的长度(N<=100)和指定的数字;第2行为N个整数,整数之间以一个空格分开。输出为N个数中与m相同的数的个数。【输入】第1行为N和m,表示整数序列的长度(N<=100)和指定的数字,中间用一个空......
  • ELEC5517: Software Defined Networks
    ELEC5517:SoftwareDefinedNetworksProjectIIwithONOScontrollerandP4languageBackgroundConsideringacompanyisdevelopinganetworktopology,andwearegoingtosimulatethedeployment.Thiscompanyhasthreedepartments,andeachdepartment......
  • 浅谈 DFT、IDFT、NTT
    DFT(离散傅里叶变换)多项式分治。最早可能是由高斯发现的多项式可以分治,但他的手稿并未作为论文发表。考虑多项式\(F(x)=a_0+a_1x^{1}+a_2x^{2}+\cdots+a_{n-1}x^{n-1}\)其中\(n=2^{k}\(k\geq0)\)。(任意多项式可以通过高位补\(0\)化为这个形式。)......
  • 【嵌入式裸机开发】基于stm32的照相机(OV7670摄像头、STM32、TFTLCD)
    基于STM32的照相机准备工作最终效果一、下位机1、主函数2、OV7670初始化二、上位机1、控制拍照2、接收图片数据准备工作一、硬件及片上资源:1,串口1(波特率:921600,PA9/PA10通过usb转ttl连接电脑,或者其他方法)上传图片数据至上位机2,串口2(波特率:115200,PA2/PA3......
  • BUUCTF蜘蛛侠呀
    解压后发现是流量包,好多icmp包发现icmp包尾部有$$STRAT打头16进制的字符串,好多重复得。我们只需要提取尾部这些字符串是当icmp的type=0时上图标识为褐色的字符串,还需要把16进制的字符串转为对应的字符串(bytes类型)并去重。使用python脚本importpysharkimportbinascii......
  • AT_awtf2024_c Fuel
    dp好题。直接考虑不好考虑,但我们有显然的转化,我们并不会多加毫无意义的油。所以我们要加的油是\(l-2C\),那么现在要求我们在每一时刻油不小于\(0\),并最小化距离。发现不太好贪心,考虑dp。如果我们到了一个加油站,我们一定会将当前的油给加满至\(C\),除非已经没必要了,但这并不重......
  • 首发!米尔全志T536核心板,17串口4CAN口、四核A55
    在智能制造与物联网技术日新月异的今天,一款集高性能、低功耗、高可靠性于一身的工业级核心板成为了推动产业升级的关键力量。米尔电子向市场推出——国产真工业级四核Cortex-A55米尔全志T536核心板,助力国产真工业级工控板快速发展,为工业自动化、工业控制、机器人等领域提供强大的......