中文分词器-ik

时间：2022-10-09 22:22:54浏览次数：41

标签：中文 end CN start ik 分词器 offset type

安装ik

1  docker exec -it  容器ID  /bin/bash

2 #下载对应版本，我用的es是8.4.1 ,地址 https://github.com/medcl/elasticsearch-analysis-ik/releases
   ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.4.1/elasticsearch-analysis-ik-8.4.1.zip
3 #安装好以后重启es
   docker restart 容器ID

测试ik

ik_smart: 粗粒度的拆分
ik_max_word: 最细粒度的拆分

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国"
}
{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    }
  ]
}

标签：中文,end,CN,start,ik,分词器,offset,type
From： https://www.cnblogs.com/mister-liu/p/16773900.html

自带分词器
Standard-默认分词器，按词切分支持多语言，并小写处理Simple-非字母切分,小写处理（UU-a切成uu,adoni`t切成doni和t）Stop-小写处理，停用词过滤(the,a,is,2)Wh......
java后台开发最常用的开发中文文档在线版
自己用到的中文文档：这里做个汇总，方便自己查看。spring：https://lfvepclr.gitbooks.io/spring-framework-5-doc-cn/content/springMVC：https://www.w3cschool.c......
python中文件操作相关基础知识
python中文件操作相关基础知识文件的概念1，文件的概念？文件就是操作系统暴露给用户操作硬盘的快捷方式，当我们双击一个文件进行打开时，其实就是把硬盘中的数据加载......
解决curl下载夹带中文的文件
1.提供代码重点关注curl_escapeAPI#include<stdlib.h>#include<stdio.h>#include<sys/stat.h>#include<curl/curl.h>size_tgetcontentlengthfunc(void*p......
Mysql 插入中文错误：Incorrect string value: '\xE7\xA8\x8B\xE5\xBA\x8F...' fo
今天mysql遇到了一点问题。首先我说一下，mysql安装的话默认编码方式是拉丁文。不是 UTF-8. 这个错误原因就是因为编码格式不一致造成的。简单粗暴一点，重新建一个......
es索引、类型（mapping）、文档、ik分词器
一、概念1、初学可以把es看作数据库可以建立索引（库）文档（库中的数据）2、es是面向文档的，一切都是json3、物理设计es后台把每个索引划分成多个分片，每份分片可以在集群中的不同......
中文语义相似度匹配模型
zhaogaofeng611/TextMatch:基于Pytorch的，中文语义相似度匹配模型（ABCNN、Albert、Bert、BIMPM、DecomposableAttention、DistilBert、ESIM、RE2、Roberta、SiaGRU......
es6 中文
https://m.w3cschool.cn/escript6/escript6-cx4337fr.htmlhttps://es6-org.github.io/exploring-es6/#./16.7.mdhttps://w3ctech.com/topic/2045......
Python 生成的页面中文乱码问题
第一保证程序源文件中的中文的编码格式，如我们把源文件的编码设置成utf8的。reload(sys)sys.setdefaultencoding(‘utf-8’)第二，告诉浏览器，我们需要用什么格式来展示......
转载：关于vscode(Visual Studio Code)编写c语言中文乱码问题
关于vscode(VisualStudioCode)编写c语言中文乱码问题。处理方法：选择菜单File > Preferences >Settings，找到TextEditor>Files中的Encoding，更改为Simplified......

中文分词器-ik

安装ik

测试ik

相关文章

赞助商

阅读排行