【数据保护】微软开源数据保护项目Presidio-中文文本扫描以及注意事项

标签：识别器 nlp 中文 zh chinese Presidio 开源数据保护

Presidio，源自拉丁语，寓意"保护"或"驻军"，是由微软推出的一项开源数据保护计划。该项目致力于协助企业与开发者在处理数据时，快速识别并脱敏敏感信息。它能够识别文本和图像中的多种敏感数据，包括但不限于信用卡号码、个人姓名、地理位置和电话号码等，并通过定制化的格式进行脱敏处理，以增强数据的安全性。

本章主要介绍Presidio对于中文文本敏感信息的扫描，关于Presidio更多了解可以参考其它文章：

Presidio中默认使用spaCy作为NLP语言分析模型，因此如果想要支持中文首先下载中文语言模型。模型下载步骤可参考spaCy语言模型下载链接。

第一步：下载安装中文语言模型。当前环境我已经安装了英语、西班牙语和两个汉语模型，可通过命令spacy info查看。

$ spacy info
============================== Info about spaCy ==============================

spaCy version    3.7.5
Location         g:\python项目\django项目\pythonproject1\venv\lib\site-packages\spacy
Platform         Windows-10-10.0.19045-SP0
Python version   3.8.10
Pipelines        en_core_web_lg (3.7.1), es_core_news_md (3.7.0), zh_core_web_lg (3.7.0), zh_core_web_sm (3.7.0)

第二步：编写配置信息，指定Presidio支持中文模型。

configure = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "zh", "model_name": "zh_core_web_lg"}
    ]
}

第三步：基于Presidio模块创建NLP引擎对象。

provider = NlpEngineProvider(nlp_configuration=configure)
nlp_engine_with_chinese = provider.create_engine()

第四步：基于Presidio模块创建基于正则表达式的中文识别器。本例中编写的识别器识别两个中文词语"帽衫"和"忠诚"。

chinese_pattern = Pattern(name="chinese_pattern", regex="(帽衫|忠诚)", score=0.5)
chinese_recognizer = PatternRecognizer(supported_entity="CHINESE", patterns=[chinese_pattern],
                                       supported_language="zh")

第五步：基于Presidio模块创建识别器并指定NLP引擎，并将中文识别器添加到分析器列表中。

analyze = AnalyzerEngine(
     nlp_engine=nlp_engine_with_chinese, supported_languages=["zh"])
analyze.registry.add_recognizer(chinese_recognizer)

第六步：扫描中文文本。

result = analyze.analyze(text=text, language="zh", entities=["CHINESE"])
print("result = ", result)

综上，整个扫描过程代码如下：

from presidio_analyzer import Pattern, AnalyzerEngine, PatternRecognizer
from presidio_analyzer.nlp_engine import NlpEngineProvider


text = "你好我有一个帽衫，我想在网上问问，为什么我带着显得这么忠诚。"
# 第一步：下载安装中文语言模型。模型名称为zh_core_web_lg
# 第二步：编写配置信息，指定Presidio支持中文模型。
configure = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "zh", "model_name": "zh_core_web_lg"}
    ]
}

# 第三步：基于Presidio模块创建NLP引擎对象。
provider = NlpEngineProvider(nlp_configuration=configure)
nlp_engine_with_chinese = provider.create_engine()

# 第四步：基于Presidio模块创建基于正则表达式的中文识别器。
chinese_pattern = Pattern(name="chinese_pattern", regex="(帽衫|忠诚)", score=0.5)
chinese_recognizer = PatternRecognizer(supported_entity="CHINESE", patterns=[chinese_pattern],
                                       supported_language="zh")

# 第五步：基于Presidio模块创建识别器并指定NLP引擎，并将中文识别器添加到分析器列表中。
analyze = AnalyzerEngine(
     nlp_engine=nlp_engine_with_chinese, supported_languages=["zh"])
analyze.registry.add_recognizer(chinese_recognizer)

# 第六步：扫描中文文本。
result = analyze.analyze(text=text, language="zh", entities=["CHINESE"])
print("result = ", result)

上面代码执行结果如下：

result =  [type: CHINESE, start: 6, end: 8, score: 0.5, type: CHINESE, start: 27, end: 29, score: 0.5]

在使用Presidio对中文文本进行扫描时，可能会遇到以下问题：

问题一：遇到ValueError: No matching recognizers were found to serve the request.

出现此问题是因为分析器在扫描中文文本时没有找到对应的识别器。需要注意，在创建分析器对象PatternRecognizer时supported_language参数一定要设置成"zh"，否则默认为"en"，则Presidio认为此识别器识别的语言为英语。创建分析器对象AnalyzerEngine时，supported_languages参数也要将"zh"加入支持列表中。调用分析器的analyze函数时也要设置language参数为"zh"。

问题二：创建敏感词语列表识别器却不起作用。

这个是Presidio当前对中文支持不完善的地方，我们深入Presidio源码可以发现，Presidio在PatternRecognizer类的构造函数中，判断如果创建的是敏感列表，它会将敏感列表转换成正则表达式。例如我们的敏感列表中包含"帽衫"、"忠诚"两个词，Presidio会将其转换成正则表达式"(?:^|(?<=\W)(帽衫|忠诚)(?:(?=\W)|$)"，转化代码如下。

def _deny_list_to_regex(self, deny_list: List[str]) -> Pattern:
    """
    Convert a list of words to a matching regex.
    To be analyzed by the analyze method as any other regex patterns.
    :param deny_list: the list of words to detect
    :return:the regex of the words for detection
    """

    # Escape deny list elements as preparation for regex
    escaped_deny_list = [re.escape(element) for element in deny_list]
    regex = r"(?:^|(?<=\W))(" + "|".join(escaped_deny_list) + r")(?:(?=\W)|$)"
    return Pattern(name="deny_list", regex=regex, score=self.deny_list_score)

比如说我们分析下面这段文本："你好我有一个帽衫，我想在网上问问，为什么我带着显得这么忠诚。"，敏感器列表是扫描不到"帽衫"和"忠诚"，但是下面这段文本却可以扫描到:"你好我有一个帽衫，我想在网上问问，为什么我带着显得这么忠诚。"，没错，在"帽衫"和"忠诚"前后加上空格就可以了。就是这段正则表达式"(?:^|(?<=\W)(帽衫|忠诚)(?:(?=\W)|$)"导致的。因为英语来说，单词之间是有空格分隔的，所以Presidio默认敏感列表中单词前后都应该有空格，否则可能是其它单词中的一部分，而中文文本中单词之间是没有空格的，所以会出现这个问题。

针对这问题，我们可以使用正则表达式的识别器或者自定义识别器替代就好了，中文情况下尽量不要使用敏感列表识别器。

标签：识别器,nlp,中文,zh,chinese,Presidio,开源,数据保护
From： https://blog.csdn.net/qq_29490749/article/details/140938216

【数据保护】微软开源数据保护项目Presidio-中文文本扫描以及注意事项

相关文章

赞助商

阅读排行