在SpringBoot项目中接入sensitive-word实现敏感词过滤（DFA算法、为敏感词打上标签、忽略无意义的字符、针对匹配到的敏感词作进一步判断、自定义敏感词和白名单）

标签：word SpringBoot 自定义 text 敏感 sensitive import public

文章目录

1. 前言
2. 敏感词过滤的常见解决方案
3. DFA算法
- 3.1 什么是DFA算法
- 3.2 DFA算法的原理
- - 3.2.1 数据是如何存储的
  - 3.2.2 数据是如何检索的
- 3.3 DFA算法的应用场景
4. sensitive-word简介
- 4.1 什么是sensitive-word
- 4.2 sensitive-word的官网
- 4.3 sensitive-word的性能
5. sensitive-word快速入门
- 5.1 引入Maven依赖
- 5.2 核心方法
- 5.3 示例代码
6. sensitive-word中默认的敏感词
- 6.1 默认的中文敏感词
- 6.2 默认的英文敏感词
7. SensitiveWordHelper工具类的本质
8. 为敏感词打上标签
- 8.1 通过IWordTag接口为敏感词打标签
- 8.2 通过配置文件为敏感词打标签
9. 对敏感词的结果进行处理
- 9.1 内置实现
- 9.2 自定义实现
- - 9.2.1 WordResultHandlerWordRawTags.java
  - 9.2.2 WordRawTagsDto.java
10. 忽略无意义的字符
11. 针对匹配到的敏感词作进一步判断
12. 自定义敏感词和自定义白名单
- 12.1 自定义敏感词
- 12.2 自定义白名单
- 12.3 使用自定义的敏感词和白名单
- 12.4 同时配置多个敏感词和白名单
13. 动态修改敏感词和白名单
- 13.1 针对单个敏感词的新增/删除，无需全量初始化
- - 13.1.1 方法说明
  - 13.1.2 示例代码
- 13.2 针对单个白名单的新增/删除，无需全量初始化
- - 13.2.1 方法说明
  - 13.2.2 示例代码
14. sensitive-word的其它配置
15. 在SpringBoot中整合sensitive-word
16. 完整的源代码

1. 前言

在这个信息爆炸的时代，网络已成为我们生活中不可或缺的一部分。无论是社交平台、论坛、还是在电商平台，用户生成的内容无处不在

然而，随之而来的敏感词汇问题也日益凸显，如何在保障用户体验的同时，有效过滤和管理这些敏感内容，成为摆在开发者面前的一大挑战

正是在这样的背景下，sensitive-word这款基于 DFA 算法的高性能敏感词过滤工具应运而生

2. 敏感词过滤的常见解决方案

方案	优点	缺点
数据库模糊查询	实现简单，直接利用数据库功能进行查询	查询效率低下，特别是在大数据量情况下，性能瓶颈明显
String.indexOf(“”)查找	方法使用简便，适用于小规模数据集的快速查找	在处理大量数据时效率不高，不适合大规模文本搜索
全文检索	能够进行分词处理，提高匹配的准确性和效率	需要对文本进行预处理（分词），增加了系统的复杂性
DFA算法	高效匹配，无需回溯，适用于大规模文本的快速敏感词检测	构建DFA模型可能较为复杂，对于变化频繁的敏感词库维护成本较高
接入第三方敏感词过滤	实现快捷，无需自行开发，服务稳定可靠，支持大规模数据处理	需要支付服务费用，可能存在数据安全和隐私泄露的风险

3. DFA算法

学过编译原理的同学应该更容易理解 DFA 算法

3.1 什么是DFA算法

DFA：Deterministic Finite Automaton，确定性有限自动机

DFA算法，即确定性有限自动机（Deterministic Finite Automaton）算法，是一种基于确定性有限自动机理论的字符串匹配算法，主要用于在文本中查找一个或多个特定的模式（字符串）

3.2 DFA算法的原理

DFA 算法的原理比较复杂，在这里简单介绍一下，如果想深入了解，可以在网上查找与 DFA 算法有关的资料

要理解 DFA 算法的原理，我们需要了解两部分内容：

数据是如何存储的
数据是如何检索的

3.2.1 数据是如何存储的

DFA 算法在数据存储方面使用的数据结构为嵌套的 Map（类似于 JSON 字符串）

假设现在有冰毒、大麻、大坏蛋三个敏感词，初始化后的敏感词库如下

在这里插入图片描述

每一个字都会作为 Map 的 key，对应的 value 中会有一个名为 isEnd 的属性，如果 isEnd 的值为 1，说明当前词语是结尾词，如果 isEnd 的值为 0，说明当前词语不是结尾词

在这里插入图片描述

3.2.2 数据是如何检索的

在这里插入图片描述

假设现在有一段文本：我是一个好人，并不买卖冰毒

在这里插入图片描述

根据上述流程图可以判断出冰毒是敏感词

3.3 DFA算法的应用场景

文本搜索：在文本编辑器或搜索引擎中查找特定的字符串
词法分析：在编译器设计中，用于将源代码分解成词法单元
敏感词过滤：在网络内容审核中，用于检测和过滤敏感词汇

4. sensitive-word简介

4.1 什么是sensitive-word

sensitive-word 是一个基于 DFA 算法实现的高性能 Java 敏感词过滤工具框架

4.2 sensitive-word的官网

sensitive-word

https://github.com/houbb/sensitive-word

在这里插入图片描述

值得一提的是，sensitive-word 的作者在 CSDN 上也有账号：老马啸西风

在这里插入图片描述

4.3 sensitive-word的性能

以下数据摘录自 sensitive-word 的官网

测试数据：100+ 字符串，循环 10W 次

序号	备注	场景	耗时
1	追求极致性能，可以这样配置	只做敏感词，无任何格式转换	1470ms，约 7.2W QPS
2	满足大部分场景	只做敏感词，支持全部格式转换	2744ms，约 3.7W QPS

在这里插入图片描述

5. sensitive-word快速入门

本次演示的环境为 JDK17 + SpringBoot 3.0.2

5.1 引入Maven依赖

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>sensitive-word</artifactId>
    <version>0.23.0</version>
</dependency>

5.2 核心方法

SensitiveWordHelper 作为敏感词的工具类，核心方法如下

方法	参数	返回值	说明
contains(String)	待验证的字符串	布尔值	验证字符串是否包含敏感词
replace(String, ISensitiveWordReplace)	使用指定的替换策略替换敏感词	字符串	返回脱敏后的字符串
replace(String, char)	使用指定的 char 替换敏感词	字符串	返回脱敏后的字符串
replace(String)	使用 `*` 替换敏感词	字符串	返回脱敏后的字符串
findAll(String)	待验证的字符串	字符串列表	返回字符串中所有敏感词
findFirst(String)	待验证的字符串	字符串	返回字符串中第一个敏感词
findAll(String, IWordResultHandler)	IWordResultHandler 结果处理类	字符串列表	返回字符串中所有敏感词
findFirst(String, IWordResultHandler)	IWordResultHandler 结果处理类	字符串	返回字符串中第一个敏感词
tags(String)	获取敏感词的标签	敏感词字符串	返回敏感词的标签列表

5.3 示例代码

在这里插入图片描述

CoreApiTests.java

import cn.edu.scau.sensitive.replace.CustomWordReplace;
import com.github.houbb.sensitive.word.api.IWordReplace;
import com.github.houbb.sensitive.word.core.SensitiveWordHelper;
import org.junit.jupiter.api.Test;

public class CoreApiTests {

    /**
     * 判断是否包含敏感词
     */
    @Test
    public void testContains() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        System.out.println(SensitiveWordHelper.contains(text));
    }

    /**
     * 返回第一个敏感词
     */
    @Test
    public void testFindFirst() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        System.out.println(SensitiveWordHelper.findFirst(text));
    }

    /**
     * 返回所有敏感词
     */
    @Test
    public void testFindAll() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        System.out.println(SensitiveWordHelper.findAll(text));
    }

    /**
     * 默认的替换策略
     */
    @Test
    public void testReplaceDefault() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        System.out.println(SensitiveWordHelper.replace(text));
    }

    /**
     * 指定替换敏感词所用的字符
     */
    @Test
    public void testReplace() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        System.out.println(SensitiveWordHelper.replace(text, '0'));
    }

    /**
     * 自定义替换策略(自定义替换策略类需要实现 IWordReplace 接口 )
     */
    @Test
    public void defineReplaceTest() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        IWordReplace customWordReplace = new CustomWordReplace();
        System.out.println(SensitiveWordHelper.replace(text, customWordReplace));
    }

}

CustomWordReplace.java

import com.github.houbb.sensitive.word.api.IWordContext;
import com.github.houbb.sensitive.word.api.IWordReplace;
import com.github.houbb.sensitive.word.api.IWordResult;
import com.github.houbb.sensitive.word.utils.InnerWordCharUtils;

public class CustomWordReplace implements IWordReplace {

    @Override
    public void replace(StringBuilder stringBuilder, final char[] rawChars, IWordResult wordResult, IWordContext wordContext) {
        String sensitiveWord = InnerWordCharUtils.getString(rawChars, wordResult);

        // 自定义不同的敏感词替换策略，可以从数据库读取
        if ("五星红旗".equals(sensitiveWord)) {
            stringBuilder.append("国家旗帜");
        } else if ("毛主席".equals(sensitiveWord)) {
            stringBuilder.append("教员");
        } else {
            // 其他默认使用 * 代替
            int wordLength = wordResult.endIndex() - wordResult.startIndex();
            stringBuilder.append("*".repeat(Math.max(0, wordLength)));
        }
    }

}

6. sensitive-word中默认的敏感词

我们在引入的 jar 包中可以找到默认的敏感词（共有 64419 个敏感词）

6.1 默认的中文敏感词

sensitive_word_dict.txt

在这里插入图片描述

6.2 默认的英文敏感词

sensitive_word_dict_en.txt

在这里插入图片描述

7. SensitiveWordHelper工具类的本质

如果想自定义与处理敏感词相关的配置或者与 SpringBoot 框架整合，需要创建一个 SensitiveWordBs 类的实例

SensitiveWordHelper 作为敏感词的工具类，本质上使用的是 SensitiveWordBs 类的各个方法，并且为 SensitiveWordBs 类提供了一些默认的配置，我们可以在 SensitiveWordHelper 类的源码中找到答案

在这里插入图片描述

8. 为敏感词打上标签

有时候我们希望对敏感词加一个分类标签：比如色情、暴力等

这样后续可以按照标签进行更多特性操作，比如只处理某一类的标签

8.1 通过IWordTag接口为敏感词打标签

IWordTag 只是一个抽象的接口，用户可以自行定义实现（例如从数据库查询敏感词对应的标签）

public interface IWordTag {

    /**
     * @param sensitiveWord 敏感词
     * @return 敏感词对应的标签
     */
    Set<String> getTag(String sensitiveWord);

}

示例代码

在这里插入图片描述

import java.util.Collections;
import java.util.HashSet;
import java.util.Set;

public class WordTagFromDatabase implements IWordTag {

    @Override
    public Set<String> getTag(String sensitiveWord) {
        Set<String> hashSet = new HashSet<>();

        // 从数据库中查询敏感词对应的标签
        // select tag from sensitive_word where word = ${sensitiveWord}
        String tag = "政治,国家";
        if ("五星红旗".equals(sensitiveWord)) {
            Collections.addAll(hashSet, tag.split(","));
        }

        return hashSet;
    }

}

测试结果

在这里插入图片描述

8.2 通过配置文件为敏感词打标签

我们可以自定义 dict 标签文件，通过 WordTags.file() 创建一个 WordTag 实现

标签文件中内容的格式如下：

敏感词 tag1,tag2

示例代码

如果要检测的文本中没有敏感词，sensitiveWordBs.tags() 方法会返回 null

在这里插入图片描述

@Test
public void testWordTagsFromFile() {
    String filePath = "dict_tag.txt";
    IWordTag wordTag = WordTags.file(filePath);

    SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
            .wordTag(wordTag)
            .init();

    String tags = sensitiveWordBs.tags("五星红旗").toString();
    System.out.println("[政治, 国家]".equals(tags)); // true

    tags = "[]";
    if (sensitiveWordBs.tags("练习时长两年半") != null) {
        tags = sensitiveWordBs.tags("练习时长两年半").toString();
    }
    System.out.println("[]".equals(tags)); // true
}

需要注意的是 WordTags.file(filePath) 方法本质上使用的是 File 类，标签文件放在类路径下大概率读取不到
填写文件路径时需要使用绝对路径或相对于工作目录来说的相对目录，在 IDEA 中工作目录一般是项目的根目录
如果是通过 java -jar 指令启动 jar 包的形式部署项目，工作目录默认就是 jar 包所在的目录

获取前工作目录的代码如下

import java.io.File;

public class GetCurrentWorkingDirectory {

    public static void main(String[] args) {
        // 获取当前工作目录的File对象
        File currentDirectory = new File("./");

        // 获取当前工作目录的绝对路径
        String absolutePath = currentDirectory.getAbsolutePath();

        // 打印当前工作目录的绝对路径
        System.out.println("Current working directory is: " + absolutePath);
    }

}

9. 对敏感词的结果进行处理

通过 IWordResultHandler 接口可以对敏感词的结果进行处理，允许用户自定义

9.1 内置实现

内置实现有三种：

名称	描述
WordResultHandlers.word()	只保留敏感词单词本身
WordResultHandlers.raw()	保留敏感词相关信息，包含敏感词的开始和结束下标
WordResultHandlers.wordTags()	同时保留单词和对应的词标签信息

示例代码如下

在这里插入图片描述

import com.github.houbb.sensitive.word.api.IWordResult;
import com.github.houbb.sensitive.word.core.SensitiveWordHelper;
import com.github.houbb.sensitive.word.support.result.WordResultHandlers;
import com.github.houbb.sensitive.word.support.result.WordTagsDto;
import org.junit.jupiter.api.Test;

import java.util.List;

public class WordResultHandlerTests {

    /**
     * 只保留敏感单词本身
     * 默认处理策略，WordResultHandlers.word()可省略
     */
    @Test
    public void testWordResultHandlersWord() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        List<String> wordList = SensitiveWordHelper.findAll(text, WordResultHandlers.word());
        System.out.println(wordList);
    }

    /**
     * 保留敏感词相关信息，包含敏感词的开始下标和结束下标
     */
    @Test
    public void testWordResultHandlersRaw() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        List<IWordResult> wordResults = SensitiveWordHelper.findAll(text, WordResultHandlers.raw());
        wordResults.forEach(System.out::println);
    }

    /**
     * 保留敏感词、敏感词对应的标签信息
     */
    @Test
    public void testWordResultHandlersWordTags() {
        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        // 默认敏感词标签为空
        List<WordTagsDto> wordList = SensitiveWordHelper.findAll(text, WordResultHandlers.wordTags());
        wordList.forEach(System.out::println);
    }

}

9.2 自定义实现

要想自定义如何处理敏感词的结果，需要实现 IWordResultHandler 接口，以下是一个示例（保留了敏感词本身、敏感词的开始下标和结束下标、敏感词对应的标签信息）

/**
 * 自定义结果处理器，保留敏感词本身、敏感词的开始下标和结束下标、敏感词对应的标签信息
 */
@Test
public void testCustomWordResultHandler() {
    final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

    List<WordRawTagsDto> wordList = SensitiveWordHelper.findAll(text, new WordResultHandlerWordRawTags());
    wordList.forEach(System.out::println);

    // WordRawTagsDto{word='五星红旗', tags=[], startIndex=0, endIndex=4, type='WORD'}
    // WordRawTagsDto{word='毛主席', tags=[], startIndex=9, endIndex=12, type='WORD'}
    // WordRawTagsDto{word='天安门', tags=[], startIndex=18, endIndex=21, type='WORD'}
}

9.2.1 WordResultHandlerWordRawTags.java

通过重写 handle 方法来自定义如何处理敏感词的结果

抽象类 AbstractWordResultHandler 实现了 IWordResultHandler 接口，WordResultHandlerWordRawTags 继承了 AbstractWordResultHandler 类，间接地实现了 IWordResultHandler 接口

在这里插入图片描述

import com.github.houbb.sensitive.word.api.IWordContext;
import com.github.houbb.sensitive.word.api.IWordResult;
import com.github.houbb.sensitive.word.support.result.AbstractWordResultHandler;
import com.github.houbb.sensitive.word.utils.InnerWordCharUtils;

import java.util.Set;

public class WordResultHandlerWordRawTags extends AbstractWordResultHandler<WordRawTagsDto> {

    @Override
    protected WordRawTagsDto doHandle(IWordResult wordResult, IWordContext wordContext, String originalText) {
        String word = InnerWordCharUtils.getString(originalText.toCharArray(), wordResult);
        Set<String> wordTags = wordContext.wordTag().getTag(word);

        WordRawTagsDto wordRawTagsDto = new WordRawTagsDto();
        wordRawTagsDto.setWord(word);
        wordRawTagsDto.setTags(wordTags);
        wordRawTagsDto.setType(wordResult.type());
        wordRawTagsDto.setEndIndex(wordResult.endIndex());
        wordRawTagsDto.setStartIndex(wordResult.startIndex());

        return wordRawTagsDto;
    }

}

9.2.2 WordRawTagsDto.java

在这里插入图片描述

import com.github.houbb.sensitive.word.api.IWordResult;

import java.util.Set;

public class WordRawTagsDto implements IWordResult {

    private String word;

    private Set<String> tags;

    private int startIndex;

    private int endIndex;

    private String type;

    public WordRawTagsDto() {

    }

    public int startIndex() {
        return this.startIndex;
    }


    public int endIndex() {
        return this.endIndex;
    }

    public String type() {
        return this.type;
    }

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public Set<String> getTags() {
        return tags;
    }

    public void setTags(Set<String> tags) {
        this.tags = tags;
    }

    public int getStartIndex() {
        return startIndex;
    }

    public void setStartIndex(int startIndex) {
        this.startIndex = startIndex;
    }

    public int getEndIndex() {
        return endIndex;
    }

    public void setEndIndex(int endIndex) {
        this.endIndex = endIndex;
    }

    public String getType() {
        return type;
    }

    public void setType(String type) {
        this.type = type;
    }

    @Override
    public String toString() {
        return "WordRawTagsDto{" +
                "word='" + word + '\'' +
                ", tags=" + tags +
                ", startIndex=" + startIndex +
                ", endIndex=" + endIndex +
                ", type='" + type + '\'' +
                '}';
    }

}

10. 忽略无意义的字符

我们的敏感词一般都是比较连续的，比如【傻帽】，那么就有大聪明发现，可以在中间加一些字符，比如【傻!@#$帽】跳过检测，但是骂人等攻击力不减

那么，如何应对这些类似的场景呢，我们可以指定特殊字符的跳过集合，忽略掉这些无意义的字符

可以使用内置的忽略策略（使用 SensitiveWordCharIgnores 工具类），也可以自定义要忽略的字符
如何自定义忽略策略，可以参考 SensitiveWordCharIgnores.specialChars() 方法的源码

在这里插入图片描述

import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.ignore.SensitiveWordCharIgnores;
import org.junit.jupiter.api.Test;

import java.util.List;

public class CharIgnoreTests {

    @Test
    public void testCharIgnore() {
        final String text = "傻@冒，狗+东西";

        // 默认因为有特殊字符分割，无法识别
        List<String> wordList = SensitiveWordBs.newInstance().init().findAll(text);
        System.out.println("[]".equals(wordList.toString())); // true

        // 指定忽略的字符策略，可自行实现
        wordList = SensitiveWordBs.newInstance()
                .charIgnore(SensitiveWordCharIgnores.specialChars())
                .init()
                .findAll(text);
        System.out.println("[傻@冒, 狗+东西]".equals(wordList.toString())); // true
    }

}

11. 针对匹配到的敏感词作进一步判断

有时候我们可能希望对匹配的敏感词进一步限制，比如虽然我们定义了【av】作为敏感词，但是不希望【have】被匹配

可以自定义实现 WordResultCondition 接口，实现自己的策略

系统内置的策略 alwaysTrue() 恒为真，而 englishWordMatch() 则要求英文必须全词匹配

通过 WordResultConditions 工具类可以获取匹配策略

实现	说明	支持版本
alwaysTrue	恒为真
englishWordMatch	英文单词全词匹配	v0.13.0
englishWordNumMatch	英文单词/数字全词匹配	v0.20.0
wordTags	满足特定标签的，比如只关注【广告】标签	v0.23.0
chains(IWordResultCondition …conditions)	支持指定多个条件，同时满足	v0.23.0

默认情况

在这里插入图片描述

import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.resultcondition.WordResultConditions;
import org.junit.jupiter.api.Test;

import java.util.Collections;
import java.util.List;

public class WordResultConditionTests {

    @Test
    public void testWordResultCondition() {
        final String text = "I have a nice day";

        List<String> wordList = SensitiveWordBs.newInstance()
                .wordDeny(() -> Collections.singletonList("av"))
                .wordResultCondition(WordResultConditions.alwaysTrue())
                .init()
                .findAll(text);

        System.out.println("[av]".equals(wordList.toString())); // true
    }

}

指定英文必须全词匹配

在这里插入图片描述

import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.resultcondition.WordResultConditions;
import org.junit.jupiter.api.Test;

import java.util.Collections;
import java.util.List;

public class WordResultConditionTests {

    @Test
    public void testWordResultCondition() {
        final String text = "I have a nice day";

        List<String> wordList = SensitiveWordBs.newInstance()
                .wordDeny(() -> Collections.singletonList("av"))
                .wordResultCondition(WordResultConditions.englishWordMatch())
                .init()
                .findAll(text);

        System.out.println("[av]".equals(wordList.toString())); // false
    }

}

12. 自定义敏感词和自定义白名单

有时候我们希望将敏感词的加载设计成动态的，比如控台修改，然后可以实时生效

12.1 自定义敏感词

自定义敏感词需要实现 IWordDeny 接口，接口的定义如下

在这里插入图片描述

示例代码

在这里插入图片描述

import com.github.houbb.sensitive.word.api.IWordDeny;

import java.util.List;

public class CustomWordDeny implements IWordDeny {

    @Override
    public List<String> deny() {
        return List.of("我的自定义敏感词");
    }

}

12.2 自定义白名单

在这里插入图片描述

示例代码

在这里插入图片描述

import com.github.houbb.sensitive.word.api.IWordAllow;

import java.util.List;

public class CustomWordAllow implements IWordAllow {

    @Override
    public List<String> allow() {
        return List.of("五星红旗");
    }

}

12.3 使用自定义的敏感词和白名单

在这里插入图片描述

/**
 * 使用自定义的敏感词和白名单
 * 不包含内置的敏感词和白名单
 */
@Test
public void testCustomDenyAndCustomAllow() {
    String text = "这是一个测试，我的自定义敏感词";

    SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
            .wordDeny(new CustomWordDeny())
            .wordAllow(new CustomWordAllow())
            .init();

    String sensitiveWords = sensitiveWordBs.findAll(text).toString();
    System.out.println("[我的自定义敏感词]".equals(sensitiveWords)); // true
}

12.4 同时配置多个敏感词和白名单

多个敏感词

WordDenys.chains() 方法，将多个 IWordDeny 的实现合并为同一个 IWordDeny

多个白名单

WordAllows.chains() 方法，将多个 IWordAllow 的实现合并为同一个 IWordAllow

示例代码（既使用默认的敏感词，又添加了自定义的敏感词）

在这里插入图片描述

/**
 * 将自定义的敏感词和白名单与内置的敏感词和白名单合并
 */
@Test
public void testMultipleDenyAndMultipleAllow() {
    String text = "这是一个测试，我的自定义敏感词";

    IWordDeny wordDeny = WordDenys.chains(WordDenys.defaults(), new CustomWordDeny());
    IWordAllow wordAllow = WordAllows.chains(WordAllows.defaults(), new CustomWordAllow());

    SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
            .wordDeny(wordDeny)
            .wordAllow(wordAllow)
            .init();

    String sensitiveWords = sensitiveWordBs.findAll(text).toString();
    System.out.println("[我的自定义敏感词]".equals(sensitiveWords)); // true
}

13. 动态修改敏感词和白名单

13.1 针对单个敏感词的新增/删除，无需全量初始化

使用场景：在初始化之后，我们希望针对单个敏感词新增或删除，而不是完全重新初始化

支持版本：v0.19.0

13.1.1 方法说明

addWord(word) 新增敏感词，支持单个敏感词或敏感词集合

removeWord(word) 删除敏感词，支持单个敏感词或敏感词集合

13.1.2 示例代码

在这里插入图片描述

import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.allow.WordAllows;
import com.github.houbb.sensitive.word.support.deny.WordDenys;
import org.junit.jupiter.api.Test;

import java.util.Arrays;

public class SensitiveWordAddAndRemoveTests {

    @Test
    public void testAddAndRemove() {
        final String text = "测试一下新增敏感词，验证一下删除和新增对不对";

        SensitiveWordBs sensitiveWordBs =
                SensitiveWordBs.newInstance()
                        .wordAllow(WordAllows.empty())
                        .wordDeny(WordDenys.empty())
                        .init();

        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString())); // true

        // 新增单个敏感词
        sensitiveWordBs.addWord("测试");
        sensitiveWordBs.addWord("新增");
        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString())); // true

        // 删除单个敏感词
        sensitiveWordBs.removeWord("新增");
        System.out.println("[测试]".equals(sensitiveWordBs.findAll(text).toString())); // true
        sensitiveWordBs.removeWord("测试");
        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString()));

        // 新增敏感词集合
        sensitiveWordBs.addWord(Arrays.asList("新增", "测试"));
        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString()));

        // 删除敏感词集合
        sensitiveWordBs.removeWord(Arrays.asList("新增", "测试"));
        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString()));

        // 新增敏感词数组
        sensitiveWordBs.addWord("新增", "测试");
        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString()));

        // 删除敏感词集合
        sensitiveWordBs.removeWord("新增", "测试");
        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString()));
    }

}

13.2 针对单个白名单的新增/删除，无需全量初始化

使用场景：在初始化之后，我们希望针对单个白名单新增或删除，而不是完全重新初始化

支持版本：v0.21.0

13.2.1 方法说明

addWordAllow(word) 新增白名单，支持单个敏感词或敏感词集合

removeWordAllow(word) 删除白名单，支持单个敏感词或敏感词集合

13.2.2 示例代码

在这里插入图片描述

import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.allow.WordAllows;
import org.junit.jupiter.api.Test;

import java.util.Arrays;

public class WhiteListAddAndRemoveTests {

    @Test
    public void testAddAndRemove() {
        final String text = "测试一下新增白名单，验证一下删除和新增对不对";

        SensitiveWordBs sensitiveWordBs =
                SensitiveWordBs.newInstance()
                        .wordAllow(WordAllows.empty())
                        .wordDeny(() -> Arrays.asList("测试", "新增"))
                        .init();

        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString()));

        // 新增单个白名单
        sensitiveWordBs.addWordAllow("测试");
        sensitiveWordBs.addWordAllow("新增");
        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString()));

        // 删除单个白名单
        sensitiveWordBs.removeWordAllow("测试");
        System.out.println("[测试]".equals(sensitiveWordBs.findAll(text).toString()));
        sensitiveWordBs.removeWordAllow("新增");
        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString()));

        // 新增白名单集合
        sensitiveWordBs.addWordAllow(Arrays.asList("新增", "测试"));
        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString()));

        // 删除白名单集合
        sensitiveWordBs.removeWordAllow(Arrays.asList("新增", "测试"));
        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString()));

        // 新增白名单数组
        sensitiveWordBs.addWordAllow("新增", "测试");
        System.out.println("[]".equals(sensitiveWordBs.findAll(text).toString()));

        // 删除白名单数组
        sensitiveWordBs.removeWordAllow("新增", "测试");
        System.out.println("[测试, 新增, 新增]".equals(sensitiveWordBs.findAll(text).toString()));
    }

}

14. sensitive-word的其它配置

各项配置的说明如下：

序号	方法	说明	默认值
1	ignoreCase	忽略大小写	true
2	ignoreWidth	忽略半角圆角	true
3	ignoreNumStyle	忽略数字的写法	true
4	ignoreChineseStyle	忽略中文的书写格式	true
5	ignoreEnglishStyle	忽略英文的书写格式	true
6	ignoreRepeat	忽略重复词	false
7	enableNumCheck	是否启用数字检测	false
8	enableEmailCheck	是有启用邮箱检测	false
9	enableUrlCheck	是否启用链接检测	false
10	enableIpv4Check	是否启用IPv4检测	false
11	enableWordCheck	是否启用敏感单词检测	true
12	numCheckLen	数字检测，自定义指定长度	8
13	wordTag	词对应的标签	none
14	charIgnore	忽略的字符	none
15	wordResultCondition	针对匹配的敏感词额外加工，比如可以限制英文单词必须全匹配	恒为真

示例如下（链式编程，fluent-api）

在这里插入图片描述

import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.ignore.SensitiveWordCharIgnores;
import com.github.houbb.sensitive.word.support.resultcondition.WordResultConditions;
import com.github.houbb.sensitive.word.support.tag.WordTags;
import org.junit.jupiter.api.Test;

public class CustomConfigurationTests {

    @Test
    public void testCustomConfiguration() {
        SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
                .ignoreCase(true)
                .ignoreWidth(true)
                .ignoreNumStyle(true)
                .ignoreChineseStyle(true)
                .ignoreEnglishStyle(true)
                .ignoreRepeat(false)
                .enableNumCheck(false)
                .enableEmailCheck(false)
                .enableUrlCheck(false)
                .enableIpv4Check(false)
                .enableWordCheck(true)
                .numCheckLen(8)
                .wordTag(WordTags.none())
                .charIgnore(SensitiveWordCharIgnores.defaults())
                .wordResultCondition(WordResultConditions.alwaysTrue())
                .init();

        final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前";

        System.out.println("[五星红旗, 毛主席, 天安门]".equals(sensitiveWordBs.findAll(text).toString()));
    }

}

15. 在SpringBoot中整合sensitive-word

init() 对于敏感词 DFA 的构建是比较耗时的，一般建议在应用初始化的时候只初始化一次，而不是重复初始化

将 SensitiveWordBs 交由 Spring 管理，其中 WordAllowFromDatabase 和 WordDenyFromDatabase 是基于数据库为源头的自定义实现类

在这里插入图片描述

import cn.edu.scau.sensitive.WordAllowFromDatabase;
import cn.edu.scau.sensitive.WordDenyFromDatabase;
import com.github.houbb.sensitive.word.bs.SensitiveWordBs;
import com.github.houbb.sensitive.word.support.allow.WordAllows;
import com.github.houbb.sensitive.word.support.deny.WordDenys;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class SpringSensitiveWordConfig {

    private final WordAllowFromDatabase wordAllowFromDatabase;

    private final WordDenyFromDatabase wordDenyFromDatabase;

    public SpringSensitiveWordConfig(WordAllowFromDatabase wordAllowFromDatabase, WordDenyFromDatabase wordDenyFromDatabase) {
        this.wordAllowFromDatabase = wordAllowFromDatabase;
        this.wordDenyFromDatabase = wordDenyFromDatabase;
    }

    @Bean
    public SensitiveWordBs sensitiveWordBs() {
        return SensitiveWordBs.newInstance()
                .wordAllow(WordAllows.chains(WordAllows.defaults(), wordAllowFromDatabase))
                .wordDeny(WordDenys.chains(WordDenys.defaults(), wordDenyFromDatabase))
                // 各种其他配置
                .init();
    }

}

WordAllowFromDatabase.java

import com.github.houbb.sensitive.word.api.IWordAllow;
import org.springframework.stereotype.Component;

import java.util.List;

@Component
public class WordAllowFromDatabase implements IWordAllow {

    @Override
    public List<String> allow() {
        return List.of();
    }

}

WordDenyFromDatabase.java

import com.github.houbb.sensitive.word.api.IWordDeny;
import org.springframework.stereotype.Component;

import java.util.List;

@Component
public class WordDenyFromDatabase implements IWordDeny {

    @Override
    public List<String> deny() {
        return List.of();
    }

}

16. 完整的源代码

本次演示所用的源代码已上传到 Gitee 上：sensitive-word

标签：word,SpringBoot,自定义,text,敏感,sensitive,import,public
From： https://blog.csdn.net/m0_62128476/article/details/144548205

在SpringBoot项目中接入sensitive-word实现敏感词过滤（DFA算法、为敏感词打上标签、忽略无意义的字符、针对匹配到的敏感词作进一步判断、自定义敏感词和白名单）

文章目录

1. 前言

2. 敏感词过滤的常见解决方案

3. DFA算法

3.1 什么是DFA算法

3.2 DFA算法的原理

3.2.1 数据是如何存储的

3.2.2 数据是如何检索的

3.3 DFA算法的应用场景

4. sensitive-word简介

4.1 什么是sensitive-word

4.2 sensitive-word的官网

4.3 sensitive-word的性能

5. sensitive-word快速入门

5.1 引入Maven依赖

5.2 核心方法

5.3 示例代码

6. sensitive-word中默认的敏感词

6.1 默认的中文敏感词

6.2 默认的英文敏感词

7. SensitiveWordHelper工具类的本质

8. 为敏感词打上标签

8.1 通过IWordTag接口为敏感词打标签

8.2 通过配置文件为敏感词打标签

9. 对敏感词的结果进行处理

9.1 内置实现

9.2 自定义实现

9.2.1 WordResultHandlerWordRawTags.java

9.2.2 WordRawTagsDto.java

10. 忽略无意义的字符

11. 针对匹配到的敏感词作进一步判断

12. 自定义敏感词和自定义白名单

12.1 自定义敏感词

12.2 自定义白名单

12.3 使用自定义的敏感词和白名单

12.4 同时配置多个敏感词和白名单

13. 动态修改敏感词和白名单

13.1 针对单个敏感词的新增/删除，无需全量初始化

13.1.1 方法说明

13.1.2 示例代码

13.2 针对单个白名单的新增/删除，无需全量初始化

13.2.1 方法说明

13.2.2 示例代码

14. sensitive-word的其它配置

15. 在SpringBoot中整合sensitive-word

16. 完整的源代码

相关文章

赞助商

阅读排行