我正在制作一个调制函数,它将采用带有特殊字符(@&*%)的关键字,并保持它们完整,同时从句子中删除所有其他标点符号。我设计了一个解决方案,但它非常庞大,而且可能比需要的更复杂。有没有一种方法可以以更简单的方式做到这一点。
简而言之,我的代码匹配特殊单词的所有实例以查找跨度。然后,我匹配字符以找到它们的范围,然后循环匹配列表并删除找到的单词范围内也存在的任何字符。
代码
import re
from string import punctuation
sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
# my attempt to remove punctuation
class SentenceHolder:
sentence = None
protected_words = ["Q&A"]
def __init__(sentence):
self.sentence = sentence
def remove_punctuation(self):
for punct in punctuation:
symbol_matches: List[re.Match] = [i for i in re.finditer(punct, self.sentence)]
remove_able_matches = self._protected_word_overlap(symbol_matches)
for word in reversed(remove_able_word_matches):
self.sentence = (self.modified_string[:word.start()] + " " + self.sentence[word.end():])
def _protected_word_overlap(symbol_matches)
protected_word_locations = []
for protected_word in self.protected_words :
protected_word_locations.extend([i for i in re.finditer(protected_word, self.sentence)])
protected_matches = []
for protected_word in protected_word_locations:
for symbol_inst in symbol_matches:
symbol_range: range = range(symbol_inst.start(), symbol_inst.end())
protested_word_set = set(range(protected_word.start(), protected_word.end()))
if len(protested_word_set.intersection(symbol_range)) != 0:
protected_matches.append(symbol_inst)
remove_able_matches = [sm for sm in symbol_matches if sm not in protected_matches]
return remove_able_matches
代码的输出
my_string = SentenceHolder(sentence)
my_string.remove_punctuation()
结果| ||我尝试使用正则表达式和模式来识别标点符号的所有位置,但我在 re.sub 中使用的模式在 re.match 中的工作方式不同。
"I am going to run over to Q&A and ask them a ton of questions about this that that this while surfacing the internet with my raccoon buddy the bar"
I tried to use regex and pattern to identify all the locations of the punctuation but the pattern I use in re.sub does not work similarly in re.match.
可以使用正则表达式替换来更简洁地实现这一点,方法是使用负向先行断言来排除受保护的单词:
import re
def remove_punctuation(text, protected_words):
"""删除标点符号,但保留受保护的单词。
Args:
text: 要从中删除标点符号的字符串。
protected_words: 要保留的单词列表。
Returns:
删除了标点符号的字符串。
"""
# 创建一个正则表达式模式,该模式匹配不在受保护单词内的任何标点符号。
pattern = r"[{}](?!\w*{}*\w*)".format(re.escape(string.punctuation),
'|'.join(re.escape(word) for word in protected_words))
# 使用空格替换匹配的标点符号。
return re.sub(pattern, ' ', text)
sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
protected_words = ["Q&A", "@"]
result = remove_punctuation(sentence, protected_words)
print(result)
此代码的作用如下:
-
定义一个函数:
它采用
text
(要清理的字符串)和protected_words
(要保留的单词列表)作为输入。 - 创建正则表达式模式:
-
[{}]
.format(re.escape(string.punctuation))`:这部分匹配任何标点符号。 -
(?!\w*{}*\w*)
.format('|'.join(re.escape(word) for word in protected_words)):这是一个负向先行断言,它确保匹配的标点符号不是受保护单词的一部分。它检查匹配的标点符号前面或后面是否有零个或多个单词字符 (
\w*`),并且这些单词字符之间可以有零个或多个受保护单词。 -
执行替换:
re.sub(pattern, ' ', text)
使用空格替换所有匹配的标点符号。 - 返回结果: 函数返回清理后的字符串。
此解决方案避免了显式循环和索引操作,使其更有效且更易于阅读。
标签:python,python-re From: 78786100