一、内置分词器
- 分词步骤
1). character filter:在一段文本进行分词之前,先进行预处理,eg:最常见的过滤html标签(hello -> hello), & -> and ( I & you -> I and you)
2). tokenizer:分词, eg:hello you and me -> hello, you, and, me
3). token filter:一个个小单词标准化转换 lowercase(转小写) , stop word(停用词,了 的 呢), dogs -> dog(单复数转换), liked ->like(时态转换), Tom -> tom(大小写转换), a/the/an ->干掉, mother -> mom(简写), small -> little(同义词).
standard
分词三个组件,character filter(预处理),tokenizer(分词),token filter(标准化转换)
- standard tokenizer:以单词边界进行切分
- standard token filter:什么都不做
- lowercase token filter:将所有字母转换为小写
- stop token filter(默认被禁用): 移除停用词 eg: a in the is 等
- 修改分词器设置
启用english停用词
DELETE my_index
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": { # 自定义分词器名称
"type": "standard",
"stopwords": "_english_" # 启用english停用词
}
}
}
}
}
使用默认分词查询结果
GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog in the house"
}
使用开启的停用词分词查询结果
GET /my_index/_analyze
{
"analyzer": "es_std",
"text": "a dog in the house"
}
二、自定义分词器
DELETE my_index
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { # 预处理时自定义
"&_to_and": { # 名称
"type": "mapping",
"mappings": ["&=>and"] # 将&转换为and
}
},
"filter": { # 标准化转换时自定义
"my_stopwords": { # 名称
"type": "stop",
"stopwords": ["the", "a"] # 去掉的停用词
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["html_strip", "&_to_and"],
"tokenizer": "standard",
"filter": ["lowercase", "my_stopwords"]
}
}
}
}
}
验证
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "tom&jerry are a friend in the house, <a> HAHA!!!"
}
标签:index,analyzer,standard,filter,Elasticsearch,分词器,定制,my,分词
From: https://www.cnblogs.com/l-zl/p/18119214