python，替换标点符号但保持特殊单词完整的最佳方法

时间：2024-07-24 09:36:45浏览次数：20

标签：python python-re

我正在制作一个调制函数，它将采用带有特殊字符（@&*%）的关键字，并保持它们完整，同时从句子中删除所有其他标点符号。我设计了一个解决方案，但它非常庞大，而且可能比需要的更复杂。有没有一种方法可以以更简单的方式做到这一点。

简而言之，我的代码匹配特殊单词的所有实例以查找跨度。然后，我匹配字符以找到它们的范围，然后循环匹配列表并删除找到的单词范围内也存在的任何字符。

代码

import re
from string import punctuation

sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."

# my attempt to remove punctuation
class SentenceHolder:
    sentence = None
    protected_words = ["Q&A"]

    def __init__(sentence):
        self.sentence = sentence

    def remove_punctuation(self):
        for punct in punctuation:
            symbol_matches: List[re.Match] = [i for i in re.finditer(punct, self.sentence)]
            remove_able_matches = self._protected_word_overlap(symbol_matches)

        for word in reversed(remove_able_word_matches):
            self.sentence = (self.modified_string[:word.start()] + " " + self.sentence[word.end():])                

    def _protected_word_overlap(symbol_matches)
          protected_word_locations = []
          for protected_word in self.protected_words :
              protected_word_locations.extend([i for i in re.finditer(protected_word, self.sentence)])

          
        protected_matches = []
        for protected_word in protected_word_locations:
            for symbol_inst in symbol_matches:
                symbol_range: range = range(symbol_inst.start(), symbol_inst.end())
                protested_word_set = set(range(protected_word.start(), protected_word.end()))
                if len(protested_word_set.intersection(symbol_range)) != 0:
                    protected_matches.append(symbol_inst)

        remove_able_matches = [sm for sm in symbol_matches if sm not in protected_matches]

        return remove_able_matches

代码的输出

my_string = SentenceHolder(sentence)
my_string.remove_punctuation()

结果| ||我尝试使用正则表达式和模式来识别标点符号的所有位置，但我在 re.sub 中使用的模式在 re.match 中的工作方式不同。

"I am going to run over to Q&A and ask them a ton of questions about this  that   that   this while surfacing the internet  with my raccoon buddy   the bar"

I tried to use regex and pattern to identify all the locations of the punctuation but the pattern I use in re.sub does not work similarly in re.match.

可以使用正则表达式替换来更简洁地实现这一点，方法是使用负向先行断言来排除受保护的单词：

import re

def remove_punctuation(text, protected_words):
  """删除标点符号，但保留受保护的单词。

  Args:
    text: 要从中删除标点符号的字符串。
    protected_words: 要保留的单词列表。

  Returns:
    删除了标点符号的字符串。
  """
  # 创建一个正则表达式模式，该模式匹配不在受保护单词内的任何标点符号。
  pattern = r"[{}](?!\w*{}*\w*)".format(re.escape(string.punctuation),
                                       '|'.join(re.escape(word) for word in protected_words))
  # 使用空格替换匹配的标点符号。
  return re.sub(pattern, ' ', text)

sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
protected_words = ["Q&A", "@"]
result = remove_punctuation(sentence, protected_words)
print(result)

此代码的作用如下：

定义一个函数： 它采用 text （要清理的字符串）和 protected_words （要保留的单词列表）作为输入。
创建正则表达式模式：
[{}] .format(re.escape(string.punctuation))`：这部分匹配任何标点符号。
(?!\w*{}*\w*) .format('|'.join(re.escape(word) for word in protected_words)) ：这是一个负向先行断言，它确保匹配的标点符号不是受保护单词的一部分。它检查匹配的标点符号前面或后面是否有零个或多个单词字符 ( \w*`)，并且这些单词字符之间可以有零个或多个受保护单词。
执行替换： re.sub(pattern, ' ', text) 使用空格替换所有匹配的标点符号。
返回结果： 函数返回清理后的字符串。

此解决方案避免了显式循环和索引操作，使其更有效且更易于阅读。

标签：python,python-re
From： 78786100

Python 检测 USB 设备 - IDLE 和 CMD 解释器之间的不同结果
我正在尝试解决VDI解决方案中智能卡设备的USB重定向问题。我正在使用pyscard模块作为智能卡。对于进一步的上下文，主要问题是当浏览器插件调用用于处理智能卡的python脚本时，未检测到读卡器。关于问题，当我从CMD解释器运行此代码片段时，我收到空列表，表示系统上未找......
Python查找字符串的CRC32
我尝试获取字符串数据类型变量的CRC32，但出现以下错误：>>>message='helloworld!'>>>importbinascii>>>binascii.crc32(message)Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>TypeError:aby......
使用python，如何创建重复的工作时间表
这是我们公司的小组工作安排表。为三班制，2组日夜工作，1组休息。重复白天工作4天休息2天，然后再次夜间工作4天休息2天的时间表。我想使用python(pandas)自动安排在8月9日之后。抱歉英语不好，提前感谢您的帮助以下是使用Python和Pandas创建重复工作时间表的代码......
venv 已激活，但 pip 安装仍然默认进行，并且 python 在源代码中看不到该库
在终端shell中的vscode中输入“whichpython”显示默认路径：C:\Users\erjan\AppData\Local\Programs\Python\Python311\python.exe(my_venv)但是(my_venv)意味着我的venv处于活动状态，我做了pipinstalltransformers，但下面的代码仍然显示错误-无法看到......
在Python多处理中执行二进制信号量或互斥体以进行上下文切换操作
我正在尝试自动化win应用程序和java应用程序之间的同步关系。我的标准是：启动win和jav应用程序在jav应用程序中执行命令等待jav应用程序的响应使用jav应用程序的响应到Windows应用程序作为输入。在jav应用程序中执行命令win应用程序......
在spyder-python上随机出现的这些奇怪的亮点是什么
在此处输入图像描述每次我单击此按钮或进行任何更改时，都会创建奇怪的突出显示，当我最小化功能时更是如此。有什么建议如何摆脱这些或可能的原因是什么？谢谢！我尝试更改外观首选项中的设置，但无法影响问题。很抱歉，我无法直接查看或与Spyder界面交互。我是一个AI......
比较Python字典并找到缺失的元素
我遇到了一个问题，我已经尝试了几天但没有得到任何结果。我想比较两个字典，在一个字典中有“赛前”足球比赛，在第二个字典中有“现场”足球比赛。我想将它们相互比较并打印它们（如果有）没有赛前比赛直播。示例1pre=[{"Home":"Genoa","Away":"In......
Python使用Visual Studio打印功能不显示输出
任务：检查一个整数是正数还是负数。检查整数是否能被2整除。当输入0时，我需要退出循环并报告每个计数和总和。print函数没有显示任何输出。这是我从defmain()开始使用的代码defmain():countpositive=0countnegative=0count_divisible_by_2=0sump......
Python 中的像素最小二乘法
我有一个非线性前向模型，它计算每个像素参数w的灰度图像。我还可以使用scipys优化函数来反转模型。我目前遇到的唯一问题是图像的大小使得这个解决方案非常慢...比如7%的像素在40分钟内计算得很慢。我使用for循环遍历所有像素并按像素应用模型。我尝试过......
SQL 命令在手动运行时工作正常（SQL Developer），但在 Python 的 oracledb 模块中给出 ORA-
我正在使用OracleSQL数据库，并且我想运行该命令ALTERSESSIONSETNLS_DATE_FORMAT='YYYY-MM-DD';当我从SQLDeveloper应用程序手动运行它时，它工作正常。但是，当我使用oracledb模块从Python运行它时，出现以下错误：ErrorrunningSQLscript:ORA-00922:mi......

python，替换标点符号但保持特殊单词完整的最佳方法

相关文章

赞助商

阅读排行