作用:将词语规范化
比如仅仅从字符串来看mon
和mother
是不一样的,这个就代表无法在搜索的时候匹配成功。
但是从单词含义来讲是一致,应该被匹配到。
所以这个时候就需要normaliztion来讲词语进行规范化。
- 语气词去掉
- 大小写规范化
- 时态的转化
- ...
不同的分词器,解析方式是不同的。拿standard
和english
来对比
默认分词器-standard
请求
GET _analyze
{
"analyzer": "standard",
"text": "hello Mr.Li, my name is Hanmeimei. I'm a students"
}
返回
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "mr.li",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "my",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "name",
"start_offset" : 16,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "is",
"start_offset" : 21,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "hanmeimei",
"start_offset" : 24,
"end_offset" : 33,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "i'm",
"start_offset" : 35,
"end_offset" : 38,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "a",
"start_offset" : 39,
"end_offset" : 40,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "students",
"start_offset" : 41,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
- 统一大写转小写
- 并没有忽略
is
、a
这些助词(语气词?) - 最后写的是i'm a students,多了一个
s
但是没有处理
英语分词器 -- English
请求
GET _analyze
{
"analyzer": "english",
"text": "hello Mr.Li, my name is Hanmeimei. I'm a students"
}
返回
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "mr.li",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "my",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "name",
"start_offset" : 16,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "hanmeimei",
"start_offset" : 24,
"end_offset" : 33,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "i'm",
"start_offset" : 35,
"end_offset" : 38,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "student",
"start_offset" : 41,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
- 统一大写转小写
- 忽略了
is
、a
这些助词(语气词?) - 最后写的是i'm a students,studets中的
s
被标准化去掉