首页 > 其他分享 >IK 分词器支持特殊符号、标点符号分词

IK 分词器支持特殊符号、标点符号分词

时间:2023-01-16 19:44:39浏览次数:52  
标签:LATIN Character IK 标点符号 分词器 UnicodeBlock input 特殊符号 ub

  从 GitHub 下载对应版本的 IK 分词器源码,修改 CharacterUtil.identifyCharType 方法,将特殊符号、标点符号当做中文进行处理即可。

  添加:

       else if (ub == Character.UnicodeBlock.GREEK // 希腊符号
                    // 希腊扩展符号
                    || ub == Character.UnicodeBlock.GREEK_EXTENDED
                    // 拉丁字符
                    || ub == Character.UnicodeBlock.BASIC_LATIN
                    // 拉丁补充字符
                    || ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT
                    // 拉丁扩展A字符
                    || ub == Character.UnicodeBlock.LATIN_EXTENDED_A
                    // 拉丁扩展B字符
                    || ub == Character.UnicodeBlock.LATIN_EXTENDED_B) {
                return CHAR_CHINESE;
            } else if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
                    || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                    || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                    || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                    || ub == Character.UnicodeBlock.VERTICAL_FORMS) {
                // 标点符号
                return CHAR_CHINESE;
            }

  完整的 CharacterUtil.identifyCharType 方法:

 static int identifyCharType(char input) {
        if (input >= '0' && input <= '9') {
            return CHAR_ARABIC;
        } else if ((input >= 'a' && input <= 'z')
                || (input >= 'A' && input <= 'Z')) {
            return CHAR_ENGLISH;
        } else {
            Character.UnicodeBlock ub = Character.UnicodeBlock.of(input);
            if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
                    || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
                    || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A) {
                //目前已知的中文字符UTF-8集合
                return CHAR_CHINESE;
            } else if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS //全角数字字符和日韩字符
                    //韩文字符集
                    || ub == Character.UnicodeBlock.HANGUL_SYLLABLES
                    || ub == Character.UnicodeBlock.HANGUL_JAMO
                    || ub == Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO
                    //日文字符集
                    || ub == Character.UnicodeBlock.HIRAGANA //平假名
                    || ub == Character.UnicodeBlock.KATAKANA //片假名
                    || ub == Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS) {
                return CHAR_OTHER_CJK;
            } else if (ub == Character.UnicodeBlock.GREEK // 希腊符号
                    // 希腊扩展符号
                    || ub == Character.UnicodeBlock.GREEK_EXTENDED
                    // 拉丁字符
                    || ub == Character.UnicodeBlock.BASIC_LATIN
                    // 拉丁补充字符
                    || ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT
                    // 拉丁扩展A字符
                    || ub == Character.UnicodeBlock.LATIN_EXTENDED_A
                    // 拉丁扩展B字符
                    || ub == Character.UnicodeBlock.LATIN_EXTENDED_B) {
                return CHAR_CHINESE;
            } else if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
                    || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                    || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                    || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                    || ub == Character.UnicodeBlock.VERTICAL_FORMS) {
                // 标点符号
                return CHAR_CHINESE;
            }

        }
        //其他的不做处理的字符
        return CHAR_USELESS;
    }

  之后通过 maven 打包为 jar,替换掉原来使用的 ik 包中的 jar 即可。

标签:LATIN,Character,IK,标点符号,分词器,UnicodeBlock,input,特殊符号,ub
From: https://www.cnblogs.com/niuyourou/p/17056179.html

相关文章