Elasticsearch常用的IK分析器原理

标签：lexeme hit tmpHits 分析器 Lexeme IK Elasticsearch context 分词

IK Analyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始， IKAnalyzer已经推出了4个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算法的中文分词组件。从3.0版本开始，IK发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。在2012版本中，IK实现了简单的分词歧义排除算法，标志着IK分词器从单纯的词典分词向模拟语义分词衍化。

而在其2012 的特性：

采用了特有的“正向迭代最细粒度切分算法“，支持细粒度和智能分词两种切分模式；
在系统环境：Core2 i7 3.4G双核，4G内存，window 7 64位， Sun JDK 1.6_29 64位普通pc环境测试，IK2012具有160万字/秒（3000KB/S）的高速处理能力。
2012版本的智能分词模式支持简单的分词排歧义处理和数量词合并输出。
采用了多子处理器分析模式，支持：英文字母、数字、中文词汇等分词处理，兼容韩文、日文字符
优化的词典存储，更小的内存占用。支持用户词典扩展定义。特别的，在2012版本，词典支持中文，英文，数字混合词语。

我尝试找了一下这个算法，没有找到任何相关的硬核论文，而网上资料也都比较水，只好瞧瞧Github的源码https://github.com/infinilabs/analysis-ik。我本来是没有信心的，但是想想12年前的技术能高到哪里去。

原理

IK本质上就是一个编译器，通过字典和规则来实现分词效果。主要运用了Lexeme方法。

Token, Patterns, and Lexemes

编译器是将高级语言编写的源程序翻译成低级语言的系统软件。源代码的编译过程分为几个阶段，以简化开发和设计过程。各阶段依次进行，上一阶段的输出结果将用于下一阶段。各个阶段如下：

Lexical Analysis Phase（词法分析阶段）：

在这一阶段，输入是要从左到右阅读的源程序，输出则是下一个语法分析阶段要分析的标记序列。在扫描源代码的过程中，空白字符、注释、回车符、预处理器指令、宏、换行符、空格、制表符等都会被删除。词法分析器或扫描仪还有助于错误检测。例如，如果源代码中包含无效常量、关键字拼写错误等，词法分析阶段就会进行处理。正则表达式是指定编程语言标记的标准符号。

Token

它基本上是一串字符，由于无法进一步细分，因此被视为一个单元。

Lexeme

它是源代码中的字符序列，通过给定的预定义语言规则进行匹配，每个词素都将被指定为有效标记。

Patterns

它指定了扫描仪创建Token时所遵循的一系列规则。

代码

core/IKSegmenter.java

/**
	 * 分词，获取下一个词元
	 * @return Lexeme 词元对象
	 * @throws java.io.IOException
	 */
	public synchronized Lexeme next()throws IOException{
		Lexeme l = null;
		while((l = context.getNextLexeme()) == null ){
			/*
			 * 从reader中读取数据，填充buffer
			 * 如果reader是分次读入buffer的，那么buffer要  进行移位处理
			 * 移位处理上次读入的但未处理的数据
			 */
			int available = context.fillBuffer(this.input);
			if(available <= 0){
				//reader已经读完
				context.reset();
				return null;
				
			}else{
				//初始化指针
				context.initCursor();
				do{
        			//遍历子分词器
        			for(ISegmenter segmenter : segmenters){
        				segmenter.analyze(context);
        			}
        			//字符缓冲区接近读完，需要读入新的字符
        			if(context.needRefillBuffer()){
        				break;
        			}
   				//向前移动指针
				}while(context.moveCursor());
				//重置子分词器，为下轮循环进行初始化
				for(ISegmenter segmenter : segmenters){
					segmenter.reset();
				}
			}
			//对分词进行歧义处理
			this.arbitrator.process(context, configuration.isUseSmart());
			//将分词结果输出到结果集，并处理未切分的单个CJK字符
			context.outputToResult();
			//记录本次分词的缓冲区位移
			context.markBufferOffset();			
		}
		return l;
	}

core/QuickSortSet.java

class Cell implements Comparable<Cell>{
		private Cell prev;
		private Cell next;
		private Lexeme lexeme;
		
		Cell(Lexeme lexeme){
			if(lexeme == null){
				throw new IllegalArgumentException("lexeme must not be null");
			}
			this.lexeme = lexeme;
		}

		public int compareTo(Cell o) {
			return this.lexeme.compareTo(o.lexeme);
		}

		public Cell getPrev(){
			return this.prev;
		}
		
		public Cell getNext(){
			return this.next;
		}
		
		public Lexeme getLexeme(){
			return this.lexeme;
		}
	}

用于存储Lexeme

core/CJKSegmenter.java

/**
 *  中文-日韩文子分词器
 */
class CJKSegmenter implements ISegmenter {
	
	//子分词器标签
	static final String SEGMENTER_NAME = "CJK_SEGMENTER";
	//待处理的分词hit队列
	private List<Hit> tmpHits;
	
	
	CJKSegmenter(){
		this.tmpHits = new LinkedList<Hit>();
	}

	/* (non-Javadoc)
	 * @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
	 */
	public void analyze(AnalyzeContext context) {
		if(CharacterUtil.CHAR_USELESS != context.getCurrentCharType()){
			
			//优先处理tmpHits中的hit
			if(!this.tmpHits.isEmpty()){
				//处理词段队列
				Hit[] tmpArray = this.tmpHits.toArray(new Hit[this.tmpHits.size()]);
				for(Hit hit : tmpArray){
					hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);
					if(hit.isMatch()){
						//输出当前的词
						Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);
						context.addLexeme(newLexeme);
						
						if(!hit.isPrefix()){//不是词前缀，hit不需要继续匹配，移除
							this.tmpHits.remove(hit);
						}
						
					}else if(hit.isUnmatch()){
						//hit不是词，移除
						this.tmpHits.remove(hit);
					}					
				}
			}			
			
			//*********************************
			//再对当前指针位置的字符进行单字匹配
			Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);
			if(singleCharHit.isMatch()){//首字成词
				//输出当前的词
				Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);
				context.addLexeme(newLexeme);

				//同时也是词前缀
				if(singleCharHit.isPrefix()){
					//前缀匹配则放入hit列表
					this.tmpHits.add(singleCharHit);
				}
			}else if(singleCharHit.isPrefix()){//首字为词前缀
				//前缀匹配则放入hit列表
				this.tmpHits.add(singleCharHit);
			}
			

		}else{
			//遇到CHAR_USELESS字符
			//清空队列
			this.tmpHits.clear();
		}
		
		//判断缓冲区是否已经读完
		if(context.isBufferConsumed()){
			//清空队列
			this.tmpHits.clear();
		}
		
		//判断是否锁定缓冲区
		if(this.tmpHits.size() == 0){
			context.unlockBuffer(SEGMENTER_NAME);
			
		}else{
			context.lockBuffer(SEGMENTER_NAME);
		}
	}

	/* (non-Javadoc)
	 * @see org.wltea.analyzer.core.ISegmenter#reset()
	 */
	public void reset() {
		//清空队列
		this.tmpHits.clear();
	}

}

Dictionary 和 Lexeme 匹配，最后在 core/AnalyzeContext.java 实现完整的 Lexical Analysis Phase 过程。

标签：lexeme,hit,tmpHits,分析器,Lexeme,IK,Elasticsearch,context,分词
From： https://blog.csdn.net/weixin_41446370/article/details/141551060