首页 > 编程语言 >JAVA使用DFA算法过滤敏感词

JAVA使用DFA算法过滤敏感词

时间:2024-03-09 19:13:31浏览次数:26  
标签:Map JAVA String NUMBER 算法 CommonConstant length DFA Numbers

代码示例如下:

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.util.ReUtil;
import cn.hutool.core.util.StrUtil;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import java.util.*;

public class SensitiveWordUtils {
    //最小匹配模式 
    public static int minMatchTYpe = 1;

    //最大匹配模式 
    public static int maxMatchType = 2;

    //英文字母正则式 
    public static final String englishLletter = "[a-zA-z]+";

    /** 
     * @description: 初始化词库
     * @date: 2024/3/9 10:50
     * @param sensitiveWords
     * @return java.util.Map
     */
    public static Map initKeyWordAndWhiteList(List<String> sensitiveWords) {
        if(CollUtil.isEmpty(sensitiveWords)){
            return null;
        }
        try{
            Set<String> keyWordSet = new HashSet<String>();
            for(String s: sensitiveWords){
                keyWordSet.add(s.trim());
            }
            return addSensitiveWordAndWhiteListToHashMap(keyWordSet);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

    /** 
     * @description: 构建词库
     * @date: 2024/3/9 10:51
     * @param keyWordSet
     * @return java.util.HashMap
     */
    private static HashMap addSensitiveWordAndWhiteListToHashMap(Set<String> keyWordSet){
        HashMap sensitiveWordMap = new HashMap(keyWordSet.size());
        String key = null;
        Map nowMap = null;
        Map<String, String> newWorMap = null;
        Iterator<String> iterator = keyWordSet.iterator();
        while (iterator.hasNext()) {
            key = iterator.next();
            nowMap = sensitiveWordMap;
            for (int i = 0; i < key.length(); i++) {
                char keyChar = key.charAt(i);
                Object wordMap = nowMap.get(keyChar);
                if(wordMap != null){
                    nowMap = (Map) wordMap;
                }else{
                    newWorMap = new HashMap<String, String>();
                    newWorMap.put("isEnd", "0");
                    nowMap.put(keyChar, newWorMap);
                    nowMap = newWorMap;
                }
                if(i == key.length() - 1){
                    nowMap.put("isEnd", "1");
                }
            }
        }
        return sensitiveWordMap;
    }

    /** 
     * @description: 敏感词匹配
     * @date: 2024/3/9 10:52
     * @param text 待检测文本
     * @param sensitiveWordMap 构建后的敏感词词库map
     * @param wordMap 处理后的敏感词map
     * @param wordWhiteMap 处理后的白名单map
     * @param ignoreCase 是否忽略大小写 1是 0否
     * @param ignoreSpace 是否忽略空格 1是 0否
     * @param specialScanWay 是否精确匹配 1是 0否
     * @return java.util.Map<java.lang.String,java.util.Set<java.lang.String>>
     */
    public static Map<String, Set<String>> findAllNew(String text, Map sensitiveWordMap, Map<String, String> wordMap, Map<String, String> wordWhiteMap, Integer ignoreCase, Integer ignoreSpace, Integer specialScanWay) {
        Map<String, Set<String>> result = Maps.newHashMap();
        Set<String> allSensitiveWordList = new HashSet<String>();
        long txtLength = text.length();
        for (int i = 0; i < txtLength; i++) {
            int length = checkSensitiveWordNew(text, i, maxMatchType, sensitiveWordMap, ignoreCase, ignoreSpace);
            //处理精准匹配
            if (null != specialScanWay && specialScanWay == CommonConstant.Numbers.NUMBER_1 && length > CommonConstant.Numbers.NUMBER_0) {
                String subStr = StrUtil.sub(text, i, i + length);
                if (ReUtil.count(englishLletter, subStr) > CommonConstant.Numbers.NUMBER_0) {
                    //取前一个字符 
                    String beforeSubStr = StrUtil.sub(text, i - 1, i);
                    //取后一个字符 
                    String afterSubStr = StrUtil.sub(text, i + length, i + length + 1);
                    //命中文本是顶行,且往后取一位,若是英文,不属于命中敏感词 
                    if(i == CommonConstant.Numbers.NUMBER_0 && ReUtil.count(englishLletter, afterSubStr) > CommonConstant.Numbers.NUMBER_0){
                        length = CommonConstant.Numbers.NUMBER_0;
                        //命中文本往后取一位,只要是任意英文单词,不属于命中敏感词 
                    }else if(ReUtil.count(englishLletter, afterSubStr) > CommonConstant.Numbers.NUMBER_0){
                        length = CommonConstant.Numbers.NUMBER_0;
                        //命中文本往前取一位等于n,且往前再取一位不等于 ‘\’,不属于命中敏感词
                    }else if((i - 1) >= CommonConstant.Numbers.NUMBER_0 && ReUtil.count(englishLletter, beforeSubStr) > CommonConstant.Numbers.NUMBER_0 && StrUtil.equals(beforeSubStr, "n") && !StrUtil.equals(StrUtil.sub(text, i - 2, i - 1), "\\")){
                        length = CommonConstant.Numbers.NUMBER_0;
                        //命中文本往前取一位为任意英文字符,不属于命中敏感词
                    }else if((i - 1) >= CommonConstant.Numbers.NUMBER_0 && ReUtil.count(englishLletter, beforeSubStr) > CommonConstant.Numbers.NUMBER_0) {
                        length = CommonConstant.Numbers.NUMBER_0;
                    }
                }
            }
            if (length > 0) {
                String keyWord = text.substring(i, i + length);
                String newKeyWord = "";
                if (CommonConstant.Numbers.NUMBER_1 == ignoreCase && CommonConstant.Numbers.NUMBER_1 == ignoreSpace) {
                    newKeyWord = keyWord.toLowerCase();
                    newKeyWord = StrUtil.cleanBlank(newKeyWord);
                } else if (CommonConstant.Numbers.NUMBER_1 == ignoreCase) {
                    newKeyWord = keyWord.toLowerCase();
                } else if (CommonConstant.Numbers.NUMBER_1 == ignoreSpace) {
                    newKeyWord = StrUtil.cleanBlank(keyWord);
                } else {
                    newKeyWord = keyWord;
                }
                if(wordMap.containsKey(newKeyWord) && !wordWhiteMap.containsKey(newKeyWord)){
                    allSensitiveWordList.add(wordMap.get(newKeyWord));
                }
                i = i + length - 1;
            }
        }
        result.put("allHitWord", allSensitiveWordList);
        return result;
    }

    /** 
     * @description: 从词库map中进行匹配
     * @date: 2024/3/9 10:49
     * @param text 待检测文本
     * @param beginIndex 文本下标开始位置
     * @param sensitiveWordMap 构建后的敏感词词库map
     * @param ignoreCase 是否忽略大小写 1是 0否
     * @param ignoreSpace 是否忽略空格 1是 0否
     * @return int 返回命中的字符长度
     */
    private static int checkSensitiveWordNew(String txt, int beginIndex, int matchType, Map sensitiveWordMap, Integer ignoreCase, Integer ignoreSpace) {
        boolean flag = false;
        int matchFlag = 0;
        int firstMatchFlag = 0;
        char word = 0;
        Map nowMap = sensitiveWordMap;
        for(int i = beginIndex; i < txt.length(); i++){
            word = txt.charAt(i);
            if(CommonConstant.Numbers.NUMBER_1 == ignoreSpace && Character.isSpaceChar(word)){
                matchFlag++;
                continue;
            }
            if(CommonConstant.Numbers.NUMBER_1 == ignoreCase){
                word = Character.toLowerCase(word);
            }
            nowMap = (Map)nowMap.get(word);
            if(nowMap != null){
                matchFlag++;
                if ("1".equals(nowMap.get("isEnd"))){
                    flag = true;
                    firstMatchFlag = matchFlag;
                    if(minMatchTYpe == matchType){
                        break;
                    }
                }
            }else{
                // 解决敏感词内嵌问题 如 Xinjiang和Xinjiang Independenc两个词汇 若文本为Xinjiang Inefb 则不会命中,逻辑上应命中Xinjiang 
                if(matchFlag > firstMatchFlag){
                    matchFlag = firstMatchFlag;
                }
                break;
            }
        }
        if(!flag){
            matchFlag = 0;
        }
        return matchFlag;
    }


    public static void main(String[] args) {
        //精确匹配 
        int specialScanWay = 1;
        //忽略大小写 
        int ignoreCase = 1;
        //原始敏感词词库列表 
        List<String> wordList = new ArrayList<>();
        wordList.add("台独");
        wordList.add("Xinjiang");
        wordList.add("Xinjiang production and construction Corps");
        //原始白名单列表
        List<String> allWhiteWordList = new ArrayList<>();
        allWhiteWordList.add("一台独立");
        //构建新的敏感词词库map 
        Map<String, String> wordMap = Maps.newHashMap();
        //构建新的白名单map 
        Map<String, String> wordWhiteMap = Maps.newHashMap();
        //最新词库列表(整合原始词库和白名单 并进行大小写处理) 
        List<String> newWordList = Lists.newArrayList();
        wordList.forEach(item->{
            String word = item;
            //处理大小写
            if(1 == ignoreCase){
                word = item.toLowerCase();
            }
            wordMap.put(word, item);
            newWordList.add(word);
        });
        if(CollUtil.isNotEmpty(allWhiteWordList)){
            allWhiteWordList.forEach(item->{
                String word = item;
                //处理大小写
                if(1 == ignoreCase){
                    word = item.toLowerCase();
                }
                wordWhiteMap.put(word, item);
                newWordList.add(word);
            });
        }
        String text = "这是一段测试文本,xiNJiang production,大胆台独分子,这是一台独立的计算机";
        Map sensitiveWordMap = SensitiveWordUtils.initKeyWordAndWhiteList(newWordList);
        Map<String, Set<String>> resultMap = SensitiveWordUtils.findAllNew(text, sensitiveWordMap, wordMap, wordWhiteMap, ignoreCase, 0, specialScanWay);
        System.out.println("resultMap = " + resultMap.toString());
    }
}

 

标签:Map,JAVA,String,NUMBER,算法,CommonConstant,length,DFA,Numbers
From: https://www.cnblogs.com/guliang/p/18063150

相关文章

  • Java登陆第三十三天——ES6(二)浅拷贝、深拷贝;
    对象的拷贝,就是复制一个已有对象的方式。JS中对象的拷贝<script>console.log("1.浅拷贝,对象仅拷贝引用地址,基本类型拷贝的是值");letdoor1=["木门"];letdoor2=door1;//浅拷贝仅仅是拷贝引用地址door2[0]="铁门";console.log(door1);//['铁门']......
  • Java入门(向世界呐喊、Java运行机制、IDEA)
    Java入门1.HelloWorld!(向世界呐喊)新建文件夹用于存放代码(Code)->新建Java文件(Hello.java)->使用Notepad++进行编辑->在当前路径打开CMDpublicclassHello{ publicstaticvoidmain(String[]args){ System.out.print("HelloWorld!"); }}注意:系统可能没有显示文件......
  • Java并发编程之CAS原理分析
    Java并发编程之CAS原理分析在并发编程中,我们经常需要处理多线程对共享资源的访问和修改。那么如何解决并发安全呢?一.解决并发安全问题的方案最粗暴的方式就是使用synchronized关键字了,但它是一种独占形式的锁,属于悲观锁机制,性能会大打折扣。olatile貌似也是一个不错的选择,......
  • Maxwell启动问题java.lang.RuntimeException: error: unhandled character set ‘utf8
    使用Maxwell碰到问题,查看日志后显示大概是这个问题java.lang.RuntimeException:error:unhandledcharacterset‘utf8mb3‘。网上查找,看了经验贴https://blog.csdn.net/weixin_44943845/article/details/126860077,知道原因是这个:但是不太想重新下载源码进行打包,于是决定按......
  • 算法面试通关40讲 - 栈
    20.有效的括号std::stack<T>的几个方法:top:相当于backpop:相当于pop_backpush:相当于push_backclassSolution{public:staticcharleftOf(charc){switch(c){case')':return'(';case......
  • Java登陆第三十三天——ES6(二)reset、spread、Class类语法糖
    所谓ECMAScript6也就是JS6。这次更新带来了大量的新特性,使JS代码更简洁,更强大。复习JS请走:JS入门JS6文档请走:JS6菜鸟教程reset同Java中的可变参数。publicstaticvoidtell(String...info){System.out.println(info);}在JS中,叫做reset因为箭头函数中......
  • 代码随想录算法训练营第四十一天|01背包问题, 01背包问题—— 滚动数组,分割等和子集
    01背包问题,你该了解这些! 题目链接:46.携带研究材料(第六期模拟笔试)(kamacoder.com)思路:第一次遇到背包问题,好好记住吧。代码随想录(programmercarl.com)#include<bits/stdc++.h>usingnamespacestd;intmain(){intm,n;cin>>m>>n;vector<int>z(m);vec......
  • java8特性-lambda表达式
    Lambda表达式的使用1.举例:(o1,o2)->Integer.compare(o1,o2);格式:->:lambda操作符或箭头操作符->左边:lambda形参列表(其实就是接口中的抽象方法的形参列表)->右边:labbda体(其实就是重写抽象方法中的方法体)3.lambda表达式的使用:(分为六种情况)方式一:无参,无返回值......
  • JavaEE35个系统源码
    01.基于javaEE_大学生就业信息管理系统设计与实现02.基于javaEE_企业车辆管理系统设计与实现03.基于javaEE_BS架构微博系统设计与实现04.基于javaEE健康管理系统设计与实现05.基于javaEE_医院在线挂号系统设计与实现06.基于javaEE_商品供应管理系统设计与实现07.基于javaEE_......
  • 面试准备不充分,被Java守护线程干懵了,面试官主打一个东西没用但你得会
    写在开头面试官:小伙子请聊一聊Java中的精灵线程?我:什么?精灵线程?啥时候精灵线程?面试官:精灵线程没听过?那守护线程呢?我:守护线程知道,就是为普通线程服务的线程嘛。面试官:没了?守护线程的特点,怎么使用,需要注意啥,Java中经典的守护线程都有啥?我:不知道。。。这的天,面试一个10K的工作,......