从 GitHub 下载对应版本的 IK 分词器源码,修改 CharacterUtil.identifyCharType 方法,将特殊符号、标点符号当做中文进行处理即可。
添加:
else if (ub == Character.UnicodeBlock.GREEK // 希腊符号 // 希腊扩展符号 || ub == Character.UnicodeBlock.GREEK_EXTENDED // 拉丁字符 || ub == Character.UnicodeBlock.BASIC_LATIN // 拉丁补充字符 || ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT // 拉丁扩展A字符 || ub == Character.UnicodeBlock.LATIN_EXTENDED_A // 拉丁扩展B字符 || ub == Character.UnicodeBlock.LATIN_EXTENDED_B) { return CHAR_CHINESE; } else if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS || ub == Character.UnicodeBlock.VERTICAL_FORMS) { // 标点符号 return CHAR_CHINESE; }
完整的 CharacterUtil.identifyCharType 方法:
static int identifyCharType(char input) { if (input >= '0' && input <= '9') { return CHAR_ARABIC; } else if ((input >= 'a' && input <= 'z') || (input >= 'A' && input <= 'Z')) { return CHAR_ENGLISH; } else { Character.UnicodeBlock ub = Character.UnicodeBlock.of(input); if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A) { //目前已知的中文字符UTF-8集合 return CHAR_CHINESE; } else if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS //全角数字字符和日韩字符 //韩文字符集 || ub == Character.UnicodeBlock.HANGUL_SYLLABLES || ub == Character.UnicodeBlock.HANGUL_JAMO || ub == Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO //日文字符集 || ub == Character.UnicodeBlock.HIRAGANA //平假名 || ub == Character.UnicodeBlock.KATAKANA //片假名 || ub == Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS) { return CHAR_OTHER_CJK; } else if (ub == Character.UnicodeBlock.GREEK // 希腊符号 // 希腊扩展符号 || ub == Character.UnicodeBlock.GREEK_EXTENDED // 拉丁字符 || ub == Character.UnicodeBlock.BASIC_LATIN // 拉丁补充字符 || ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT // 拉丁扩展A字符 || ub == Character.UnicodeBlock.LATIN_EXTENDED_A // 拉丁扩展B字符 || ub == Character.UnicodeBlock.LATIN_EXTENDED_B) { return CHAR_CHINESE; } else if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS || ub == Character.UnicodeBlock.VERTICAL_FORMS) { // 标点符号 return CHAR_CHINESE; } } //其他的不做处理的字符 return CHAR_USELESS; }
之后通过 maven 打包为 jar,替换掉原来使用的 ik 包中的 jar 即可。
标签:LATIN,Character,IK,标点符号,分词器,UnicodeBlock,input,特殊符号,ub From: https://www.cnblogs.com/niuyourou/p/17056179.html