分析器的组成
一个分析器由三部分组成:Character Filter/Tokenizer/Token Filer
Character Filters
- 在Tokenizer之前对文本进行处理,可以配置多个Character Filter。
- ES自带的Character Filter:HTML strip、Mapping、Pattern replace
Tokenizer
- 将原始的文本按照一定规则切分词
- ES自带的Tokenizer:whitespace、standard、uax_url_email、pattern、keyword、path hierachy
Token Filter
- 将Tokenizer输出的单词进行增加,修改,删除
- ES自带的Token Filters:Lowercase、stop、synonym
自定义分析器
如果现有的分析器不能满足需求,可以针对索引自定义分析器,一个完整的分析器包括如下三部分
- 0个或多个 character filters (字符过滤器 )
- 1个分词器tokenizer(分词器)
- 0个或多个token filters
分析器的工作顺序也是如此,首先 character filters 工作,过滤掉规则匹配的无效字符,然后进行分词,最后对分词进行过滤
案例一: 有HTML和ASCII信息的语句用标准分词器处理,并小写转换过滤
- Character Filter - HTML Strip Character Filter
- Tokenizer - Standard Tokenizer
- Token Filters - Lowercase Token Filter ,ASCII-Folding Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
##输出结果为
{
"tokens" : [
{
"token" : "is",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "this",
"start_offset" : 3,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "deja",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "vu",
"start_offset" : 16,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
案例二:映射处理符号语句,并使用pattern分词,小写转换,过滤停用词
- Character Filter - Mapping Character Filter
- Tokenizer - Pattern Tokenizer
- Token Filters - Lowercase Token Filter ,Stop Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
## 输出结果为
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "_happy_",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "person",
"start_offset" : 9,
"end_offset" : 15,
"type" : "word",
"position" : 3
},
{
"token" : "you",
"start_offset" : 21,
"end_offset" : 24,
"type" : "word",
"position" : 5
}
]
}
标签:Tokenizer,自定义,分析器,Filter,Token,token,offset,type,ES
From: https://www.cnblogs.com/tenic/p/16795906.html