要吃多少根冰棍才能说出如此冰冷刺骨的话语
简介
有了mysql,为什么还要用elasticsearch?
mysql更多是用来存储数据,在数据量过多的时候,使用ES来检索数据(快)。
ES基本概念
Index
(db库)——> type
(table 表)——> document
(一行数据)
ES检索数据为什么这么快
核心:倒排索引
如:保存记录
-
红海行动
-
探索红海行动
-
红海特别行动
-
红海记录片
-
特工红海特别探索
将内容分词记录到索引中
词 | 记录 |
---|---|
红海 | 1,2,3,4,5 |
行动 | 1,2,3 |
探索 | 2,5 |
特别 | 3,5 |
纪录片 | 4, |
特工 | 5 |
查询红海特工行动:查出后计算相关性得分,3号记录命中了2次,且3号本身才有3个单词,2/3,所以3号最匹配。
ES安装
-
下载ES(数据存储与检索,相当于mysql),kibana(可视化检索,相当于navicat)
docker pull elasticsearch:7.17.6
docker pull kibana:7.17.6
版本要统一
-
容器配置
# 将docker里的目录挂载到linux的/mydata目录中
# 修改/mydata就可以改掉docker里的
mkdir -p /home/docker/elasticsearch/config
mkdir -p /home/docker/elasticsearch/data
# es可以被远程任何机器访问
echo "http.host: 0.0.0.0" >/home/docker/elasticsearch/config/elasticsearch.yml
# 递归更改权限,es需要访问
chmod -R 777 /home/docker/elasticsearch/
-
启动容器
# 9200是用户交互端口 9300是集群心跳端口
# -e指定是单阶段运行(单机)
# -e指定占用的内存大小,生产时可以设置32G
docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
-e ES_JAVA_OPTS="-Xms64m -Xmx512m" \
-v /home/docker/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
-v /home/docker/elasticsearch/data:/usr/share/elasticsearch/data \
-v /home/docker/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
-d elasticsearch:7.17.6
# 设置开机启动elasticsearch
docker update elasticsearch --restart=always
# kibana指定了了ES交互端口9200 # 5600位kibana主页端口
docker run --name kibana -e ELASTICSEARCH_HOSTS=http://ip:9200 -p 5601:5601 -d kibana:7.17.6
# 设置开机启动kibana
docker update kibana --restart=always
docker使用小技巧:
在启动docker容器的时候,如果容器运行不起来或者起来马上挂掉,可以查看启动日志
dockerlogs '容器id/容器name'
-
启动测试
# 查看ES是否正常启动
# 浏览器访问:http://ip:9200
{
"name": "66718a266132",
"cluster_name": "elasticsearch",
"cluster_uuid": "xhDnsLynQ3WyRdYmQk5xhQ",
"version": {
"number": "7.4.2",
"build_flavor": "default",
"build_type": "docker",
"build_hash": "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
"build_date": "2019-10-28T20:40:44.881551Z",
"build_snapshot": false,
"lucene_version": "8.2.0",
"minimum_wire_compatibility_version": "6.8.0",
"minimum_index_compatibility_version": "6.0.0-beta1"
},
"tagline": "You Know, for Search"
}
# 查看kibana是否正常启动
# 浏览器访问: http://ip:5601/app/kibana
ES基础操作之批量操作——bulk
在kibana
的dev tools
里进行操作
POST /_bulk
{"delete":{"_index":"website","_type":"blog","_id":"123"}}
{"create":{"_index":"website","_type":"blog","_id":"123"}}
{"title":"my first blog post"}
{"index":{"_index":"website","_type":"blog"}}
{"title":"my second blog post"}
{"update":{"_index":"website","_type":"blog","_id":"123"}}
{"doc":{"title":"my updated blog post"}}
#! Deprecation: [types removal] Specifying types in bulk requests is deprecated.
{
"took" : 304,
"errors" : false,
"items" : [
{
"delete" : { 删除
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"result" : "not_found", 没有该记录
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 404 没有该
}
},
{
"create" : { 创建
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 2,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1,
"status" : 201
}
},
{
"index" : { 保存
"_index" : "website",
"_type" : "blog",
"_id" : "5sKNvncBKdY1wAQmeQNo",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 2,
"_primary_term" : 1,
"status" : 201
}
},
{
"update" : { 更新
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 3,
"_primary_term" : 1,
"status" : 200
}
}
]
}
ES进阶检索
es支持两种基本方式检索
-
通过
uri + 检索参数
检索文档 -
通过
uri + 请求体
检索文档
通过uri + 检索参数
检索文档
请求示例:
GET bank/_search?q=*&sort=account_number:asc
# 参数说明
q*: 查询所有
sort: 排序字段
asc: 升序
# 检索bank下所有信息,包括type和docs
GET bank/_search
通过uri + 请求体
检索文档
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" },
{ "balance":"desc"}
]
}
查询返回内容
-
took – 花费多少ms搜索
-
timed_out – 是否超时
-
shards – 多少分片被搜索了,以及多少成功/失败的搜索分片
-
max_score –文档相关性最高得分
-
hits.total.value - 多少匹配文档被找到
-
hits.sort - 结果的排序key(列),没有的话按照score排序
-
hits._score - 相关得分 (not applicable when using match_all)
ES特定查询语言DSL
es提供的一个可以执行查询Json风格的DSL(domain-specific language)。
基本语法格式
典型查询结构
{
QUERY_NAME:{ #使用的功能
FIELD_NAME:{ #功能参数
ARGUMENT:VALUE,
ARGUMENT:VALUE,...
}
}
}
示例:
GET bank/_search
{
"query" : { #查询的字段
"match_all":{}
},
"from":0, #从第几条文档开始查
"size":5,
"_source":["balabce"], #要返回的字段
"sort":[
{
"account_number":{ #返回结果按哪个列排序
"order":"desc"
}
}
]
}
参数说明:
-
match_all
:查询类型【代表查询所有的索引】,es中可以在query中组合非常多的查询类型完成复杂查询。 -
除了
query
参数外,可以传递其它参数过滤查询结果。 -
from + size
限定,完成分页功能。 -
sort
排序,多字段排序,会在前序字段相等时后续字段内部排序,否则以前序为准。
查询结果:
{
"took" : 18, # 花了18ms
"timed_out" : false, # 没有超时
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000, # 命令1000条
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "999", # 第一条数据id是999
"_score" : null, # 得分信息
"_source" : {
"firstname" : "Dorothy",
"balance" : 6087
},
"sort" : [ # 排序字段的值
999
]
},
省略......
query/match匹配查询
如果是非字符串,会进行精确匹配。如果是字符串,会进行全文检索。
-
基本类型(非字符串),精确匹配
GET bank/_search
{
"query":{
"match":{
"account_number":"20"
}
}
}
查询结果:返回account_number=20的数据
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1, // 得到一条
"relation" : "eq"
},
"max_score" : 1.0, # 最大得分
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "20",
"_score" : 1.0,
"_source" : { # 该条文档信息
"account_number" : 20,
"balance" : 16418,
"firstname" : "Elinor",
"lastname" : "Ratliff",
"age" : 36,
"gender" : "M",
"address" : "282 Kings Place",
"employer" : "Scentric",
"email" : "elinorratliff@scentric.com",
"city" : "Ribera",
"state" : "WA"
}
}
]
}
}
-
字符串,全文检索
GET bank/_search
{
"query": {
"match": {
"address":"kings"
}
}
}
查询结果:最终会按照评分进行排序,会对检索条件进行分词匹配
{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 5.990829,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "20",
"_score" : 5.990829,
"_source" : {
"account_number" : 20,
"balance" : 16418,
"firstname" : "Elinor",
"lastname" : "Ratliff",
"age" : 36,
"gender" : "M",
"address" : "282 Kings Place",
"employer" : "Scentric",
"email" : "elinorratliff@scentric.com",
"city" : "Ribera",
"state" : "WA"
}
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "722",
"_score" : 5.990829,
"_source" : {
"account_number" : 722,
"balance" : 27256,
"firstname" : "Roberts",
"lastname" : "Beasley",
"age" : 34,
"gender" : "F",
"address" : "305 Kings Hwy",
"employer" : "Quintity",
"email" : "robertsbeasley@quintity.com",
"city" : "Hayden",
"state" : "PA"
}
}
]
}
}
query/match_phrase[不拆分匹配]
将需要匹配的值当成一整个单词(不分词)进行检索
-
match_phrase:不拆分字符串进行检索
-
字段.keyword:必须全匹配上才检索成功
两者区别:
使用keyword,匹配的条件就是要显示字段的全部值,精确匹配。
match_phrase是做短语匹配,只要文本中包含匹配条件,就能匹配到。
使用示例:
GET bank/_search
{
"query": {
"match_phrase": {
"address": "mill road" # 就是说不要匹配只有mill或只有road的,要匹配mill road一整个子串
}
}
}
{
"took" : 32,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 8.926605,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "970",
"_score" : 8.926605,
"_source" : {
"account_number" : 970,
"balance" : 19648,
"firstname" : "Forbes",
"lastname" : "Wallace",
"age" : 28,
"gender" : "M",
"address" : "990 Mill Road", # "mill road"
"employer" : "Pheast",
"email" : "forbeswallace@pheast.com",
"city" : "Lopezo",
"state" : "AK"
}
}
]
}
}
GET bank/_search
{
"query": {
"match": {
"address.keyword": "990 Mill" # 字段后面加上 .keyword
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0, # 因为要求完全equal,所以匹配不到
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
query/multi_math[多字段匹配]
如:state或address中包含mill(查询过程中,会对查询条件进行分词)
GET bank/_search
{
"query": {
"multi_match": { # 前面的match仅指定了一个字段。
"query": "mill",
"fields": [ # state和address有mill子串 不要求都有
"state",
"address"
]
}
}
}
{
"took" : 28,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : 5.4032025,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "970",
"_score" : 5.4032025,
"_source" : {
"account_number" : 970,
"balance" : 19648,
"firstname" : "Forbes",
"lastname" : "Wallace",
"age" : 28,
"gender" : "M",
"address" : "990 Mill Road", # 有mill
"employer" : "Pheast",
"email" : "forbeswallace@pheast.com",
"city" : "Lopezo",
"state" : "AK" # 没有mill
}
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "136",
"_score" : 5.4032025,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane", # mill
"employer" : "Neteria",
"email" : "winnieholland@neteria.com",
"city" : "Urie",
"state" : "IL" # 没有mill
}
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "345",
"_score" : 5.4032025,
"_source" : {
"account_number" : 345,
"balance" : 9812,
"firstname" : "Parker",
"lastname" : "Hines",
"age" : 38,
"gender" : "M",
"address" : "715 Mill Avenue", #
"employer" : "Baluba",
"email" : "parkerhines@baluba.com",
"city" : "Blackgum",
"state" : "KY" # 没有mill
}
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "472",
"_score" : 5.4032025,
"_source" : {
"account_number" : 472,
"balance" : 25571,
"firstname" : "Lee",
"lastname" : "Long",
"age" : 32,
"gender" : "F",
"address" : "288 Mill Street", #
"employer" : "Comverges",
"email" : "leelong@comverges.com",
"city" : "Movico",
"state" : "MT" # 没有mill
}
}
]
}
}
query/bool/must复合查询
-
must:必须达到must所列举的所有条件
-
must_not:必须不匹配must_not所列举的所有条件
-
should:应该满足should所列举的条件,越满足得分越高
使用示例:
GET bank/_search
{
"query":{
"bool":{ #
"must":[ # 必须有这些字段
{"match":{"address":"mill"}},
{"match":{"gender":"M"}}
]
}
}
}
GET bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "gender": "M" }},
{ "match": {"address": "mill"}}
],
"must_not": [ # 不可以是指定值
{ "match": { "age": "38" }}
]
}
}
GET bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"gender": "M"
}
},
{
"match": {
"address": "mill"
}
}
],
"must_not": [
{
"match": {
"age": "18"
}
}
],
"should": [
{
"match": {
"lastname": "Wallace"
}
}
]
}
}
}
query/filter[结果过滤]
-
must 贡献得分
-
should 贡献得分
-
must_not 不贡献得分
-
filter 不贡献得分
并不是所有的查询都需要产生分数,特别是那些仅用于filter过滤的文档。
为了不计算分数,es会自动检查场景并且优化查询的执行。
GET bank/_search
{
"query": {
"bool": {
"must": [
{ "match": {"address": "mill" } }
],
"filter": { # query.bool.filter
"range": {
"balance": { # 哪个字段
"gte": "10000",
"lte": "20000"
}
}
}
}
}
}
先是查询所有匹配address=mill的文档,然后再根据10000<=balance<=20000进行过滤查询结果。
query/term
和match一样,匹配某个属性的值。
区别:
-
全文检索字段用match
-
其他非text字段匹配用term
使用示例:
GET bank/_search
{
"query": {
"term": {
"address": "mill Road"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0, # 没有
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
更换为match进行检索
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 32,
"relation" : "eq"
},
"max_score" : 8.926605,
"hits" : [
aggs/agg1聚合
聚合提供了从数据中分组和提取数据的能力。
-
aggs:执行聚合
"aggs":{ #聚合
"aggs_name":{ # 这次聚合的名字,方便展示在结果集中
"AGG_TYPE" {} #聚合的类型(avg,term,terms)
}
}
# terms: 看值的可能性分布,会合并锁查字段,给出计数即可
# avg: 看值的平均分布
使用示例:
address中包含mill的所有人的年龄分布以及平均年龄,但不显示这些人的详情
# 分别为包含mill、,平均年龄、
GET bank/_search
{
"query": { # 查询出包含mill的
"match": {
"address": "Mill"
}
},
"aggs": { #基于查询聚合
"ageAgg": { # 聚合的名字,随便起
"terms": { # 看值的可能性分布
"field": "age",
"size": 10
}
},
"ageAvg": {
"avg": { # 看age值的平均
"field": "age"
}
},
"balanceAvg": {
"avg": { # 看balance的平均
"field": "balance"
}
}
},
"size": 0 # 不看详情
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4, // 命中4条
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ageAgg" : { // 第一个聚合的结果
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 38, # age为38的有2条
"doc_count" : 2
},
{
"key" : 28,
"doc_count" : 1
},
{
"key" : 32,
"doc_count" : 1
}
]
},
"ageAvg" : { // 第二个聚合的结果
"value" : 34.0 # age字段的平均值是34
},
"balanceAvg" : {
"value" : 25208.0
}
}
}
aggs/aggName/aggs/aggName子聚合
按照年龄聚合,并且求这些年龄段的这些人的平均薪资
GET bank/_search
{
"query": {
"match_all": {}
},
"aggs": {
"ageAgg": {
"terms": { # 看分布
"field": "age",
"size": 100
},
"aggs": { # 与terms并列
"ageAvg": { #平均
"avg": {
"field": "balance"
}
}
}
}
},
"size": 0
}
{
"took" : 49,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ageAgg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 31,
"doc_count" : 61,
"ageAvg" : {
"value" : 28312.918032786885
}
},
{
"key" : 39,
"doc_count" : 60,
"ageAvg" : {
"value" : 25269.583333333332
}
},
{
"key" : 26,
"doc_count" : 59,
"ageAvg" : {
"value" : 23194.813559322032
}
},
{
"key" : 32,
"doc_count" : 52,
"ageAvg" : {
"value" : 23951.346153846152
}
},
{
"key" : 35,
"doc_count" : 52,
"ageAvg" : {
"value" : 22136.69230769231
}
},
{
"key" : 36,
"doc_count" : 52,
"ageAvg" : {
"value" : 22174.71153846154
}
},
{
"key" : 22,
"doc_count" : 51,
"ageAvg" : {
"value" : 24731.07843137255
}
},
{
"key" : 28,
"doc_count" : 51,
"ageAvg" : {
"value" : 28273.882352941175
}
},
{
"key" : 33,
"doc_count" : 50,
"ageAvg" : {
"value" : 25093.94
}
},
{
"key" : 34,
"doc_count" : 49,
"ageAvg" : {
"value" : 26809.95918367347
}
},
{
"key" : 30,
"doc_count" : 47,
"ageAvg" : {
"value" : 22841.106382978724
}
},
{
"key" : 21,
"doc_count" : 46,
"ageAvg" : {
"value" : 26981.434782608696
}
},
{
"key" : 40,
"doc_count" : 45,
"ageAvg" : {
"value" : 27183.17777777778
}
},
{
"key" : 20,
"doc_count" : 44,
"ageAvg" : {
"value" : 27741.227272727272
}
},
{
"key" : 23,
"doc_count" : 42,
"ageAvg" : {
"value" : 27314.214285714286
}
},
{
"key" : 24,
"doc_count" : 42,
"ageAvg" : {
"value" : 28519.04761904762
}
},
{
"key" : 25,
"doc_count" : 42,
"ageAvg" : {
"value" : 27445.214285714286
}
},
{
"key" : 37,
"doc_count" : 42,
"ageAvg" : {
"value" : 27022.261904761905
}
},
{
"key" : 27,
"doc_count" : 39,
"ageAvg" : {
"value" : 21471.871794871793
}
},
{
"key" : 38,
"doc_count" : 39,
"ageAvg" : {
"value" : 26187.17948717949
}
},
{
"key" : 29,
"doc_count" : 35,
"ageAvg" : {
"value" : 29483.14285714286
}
}
]
}
}
}
nested对象聚合
属性是"type": “nested”,因为是内部的属性进行检索
数组类型的对象会被扁平化处理(对象的每个属性会分别存储到一起)
user.name=["aaa","bbb"]
user.addr=["ccc","ddd"]
这种存储方式,可能会发生如下错误:
错误检索到{aaa,ddd},这个组合是不存在的
数组的扁平化处理会使检索能检索到本身不存在的,为了解决这个问题,就采用了嵌入式属性,数组里是对象时用嵌入式属性(不是对象无需用嵌入式属性)。
GET articles/_search
{
"size": 0,
"aggs": {
"nested": { #
"nested": { #
"path": "payment"
},
"aggs": {
"amount_avg": {
"avg": {
"field": "payment.amount"
}
}
}
}
}
}
Mapping字段映射
es字段类型
参考: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-types.html
映射是用来定义一个文档,以及它所包含的属性是如何存储和索引的。
如:使用mapping来定义:
-
哪些字符串应该被看做全文本属性
-
哪些属性包含数字、日期或地理位置
-
文档中的所有属性是否都能被索引
-
日期的格式
-
自定义映射规则来执行动态添加属性
-
查看mapping信息:GET bank/_mapping
{
"bank" : {
"mappings" : {
"properties" : {
"account_number" : {
"type" : "long" # long类型
},
"address" : {
"type" : "text", # 文本类型,会进行全文检索,进行分词
"fields" : {
"keyword" : { # addrss.keyword
"type" : "keyword", # 该字段必须全部匹配到
"ignore_above" : 256
}
}
},
"age" : {
"type" : "long"
},
"balance" : {
"type" : "long"
},
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"employer" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"firstname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"gender" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"lastname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"state" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
es的不同版本
ElasticSearch7-去掉type概念
关系型数据库中两个数据表示是独立的,即使他们里面有相同名称的列也不影响使用,但ES中不是这样的。elasticsearch是基于Lucene开发的搜索引擎,而ES中不同type下名称相同的filed最终在Lucene中的处理方式是一样的。
-
两个不同type下的两个user_name,在ES同一个索引下其实被认为是同一个filed,你必须在两个不同的type中定义相同的filed映射。否则,不同type中的相同字段名称就会在处理中出现冲突的情况,导致Lucene处理效率下降。
-
去掉type就是为了提高ES处理数据的效率。
Elasticsearch 7.x URL中的type参数为可选。比如,索引一个文档不再要求提供文档类型。
Elasticsearch 8.x 不再支持URL中的type参数。
解决: 将索引从多类型迁移到单类型,每种类型文档一个独立索引
将已存在的索引下的类型数据,全部迁移到指定位置即可。详见数据迁移
分词
一个tokenizer
分词器接收一个字符流,将之分割为独立的tokens
词元(通常是独立的单词),然后输出tokens流。
如:
POST _analyze
{
"analyzer": "standard",
"text": "The 2 Brown-Foxes bone."
}
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "bone",
"start_offset" : 18,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
对于中文,需要安装额外的分词器。
ik分词器
安装
所有的语言分词,默认使用的都是"standard analyzer",这些分词器针对中文的分词不友好,因此,需要安装中文的分词器。
-
下载
下载路径: https://github.com/medcl/elasticsearch-analysis-ik/releases
-
创建ik文件夹,并将下载好的zip文件包解压,放入到ik文件夹
注意,这里不创建ik文件夹直接解压在plugins目录下的话,服务起不了,会报错。
#进入es安装目录下的plugins下,创建ik文件夹
cd /home/docker/elasticsearch/plugins
mkdir ik
#将下载好的zip包放入ik文件夹下,执行解压
unzip elasticsearch-analysis-ik-7.17.6.zip
-
重启服务
-
验证
http://192.168.106.130:9200/_cat/plugins
f70ffa187733 analysis-ik 7.17.6
分词器测试
使用默认的分词器
GET _analyze
{
"text":"我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "国",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
GET _analyze
{
"analyzer": "ik_smart",
"text":"我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}
GET _analyze
{
"analyzer": "ik_max_word",
"text":"我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
}
]
}
自定义词库
-
修改/usr/share/elasticsearch/plugins/ik/config中的IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://ip/es/fenci.txt</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
修改完成后,重启elasticsearch容器。
更新完成后,es只会对于新增的数据用更新分词。历史数据是不会重新分词的。如果想要历史数据重新分词,需要执行:
POST my_index/_update_by_query?conflicts=proceed标签:count,type,使用,value,doc,ElasticSearch,offset,安装,match From: https://www.cnblogs.com/l12138h/p/16716351.html