我正在努力删除一些公司名称中的后缀。预期结果如下:
原始名称:
Apple Inc.
Sony Corporation
Fiat Chrysler Automobiles S.p.A.
Samsung Electronics Co., Ltd.
清除名称:
Apple
Sony
Fiat Chrysler Automobiles
Samsung Electronics
到目前为止我所做的:
import re
def remove_company_suffixes(company_name):
suffix_pattern = r"\s*(?:co(?:rp(?:oration)?|mpany)?|ltd\.|llc|gmbh|sa|sp\.a\.|s\.r\.l\.|ag|nv|bv|inc\.|s\.a\.s\.|e\.u\.|s\.l\.|s\.a\.l\.|doo|dooel|d.o.o.|szr|ltd|inc|llc|corp|ag|sa|sp|sl)\.?$"
return re.sub(suffix_pattern, '', company_name.strip())
company_names = ["Apple Inc.", "Sony Corporation", "Fiat Chrysler Automobiles S.p.A.", "Samsung Electronics Co., Ltd.", "Plasticos SA", "ABC GmbH"]
for company_name in company_names:
cleaned_name = remove_company_suffixes(company_name)
print(cleaned_name)
结果是:
Apple
Sony
Fiat Chrysler Automobiles S.p.A.
Samsung Electronics Co.,
Plasticos
ABC
的正则表达式看起来不错,而且几乎已经捕捉到所有后缀。问题出在“Fiat Chrysler Automobiles S.p.A.”这个名称中,其中“S.p.A.”前有多个空格。
可以通过在空格匹配部分添加
+
来修复此问题,以便它匹配一个或多个空格:
suffix_pattern = r"\s+(?:co(?:rp(?:oration)?|mpany)?|ltd\.|llc|gmbh|sa|sp\.a\.|s\.r\.l\.|ag|nv|bv|inc\.|s\.a\.s\.|e\.u\.|s\.l\.|s\.a\.l\.|doo|dooel|d.o.o.|szr|ltd|inc|llc|corp|ag|sa|sp|sl)\.?$"
以下是更新后的代码:
import re
def remove_company_suffixes(company_name):
# 使用\s+匹配一个或多个空格
suffix_pattern = r"\s+(?:co(?:rp(?:oration)?|mpany)?|ltd\.|llc|gmbh|sa|sp\.a\.|s\.r\.l\.|ag|nv|bv|inc\.|s\.a\.s\.|e\.u\.|s\.l\.|s\.a\.l\.|doo|dooel|d.o.o.|szr|ltd|inc|llc|corp|ag|sa|sp|sl)\.?$"
return re.sub(suffix_pattern, '', company_name.strip())
company_names = ["Apple Inc.", "Sony Corporation", "Fiat Chrysler Automobiles S.p.A.", "Samsung Electronics Co., Ltd.", "Plasticos SA", "ABC GmbH"]
for company_name in company_names:
cleaned_name = remove_company_suffixes(company_name)
print(cleaned_name)
这将输出以下结果,如所愿:
Apple
Sony
Fiat Chrysler Automobiles
Samsung Electronics
Plasticos
ABC
标签:python,regex,string
From: 78800653