我有一个在我编写的脚本中使用的函数,用于从列表中删除多余的阻塞关键字。基本上,输入(以任何顺序):
{"apple", "bapple", "banana", "cherry", "bananaman", "sweetherrypie", "sweet", "b"}
它应该输出一个缩小的字符串数组(以任何顺序):
{"apple", "cherry", "sweet", "b"}
这样做是为了减少匹配算法需要传递的关键字数量,如一个个人学习项目(如果一个字符串包含“banana”,它也会包含“b”,所以香蕉是不必要的,并且可以被丢弃
它需要具有极高的性能,因为它必须在超过 10,000,000 个字符串的列表上运行.
最初,我编写了一些 Python 代码来执行此操作:
def deduplicate_keywords(keywords):
# Remove duplicates and sort the keywords by length in ascending order
keywords = sorted(set(keywords), key=len)
unique_keywords = []
for keyword in keywords:
not_redundant = True
for unique_keyword in unique_keywords:
# Check if the current keyword includes any of the unique keywords
if unique_keyword in keyword:
not_redundant = False
break
# If the keyword does not include any of the shorter keywords, add it to unique_keywords
if not_redundant:
unique_keywords.append(keyword)
return unique_keywords
以及一个性能基准来衡量其性能:
class TestDeduplicateKeywords(unittest.TestCase):
def test_performance_intensive_case(self):
keywords = [
"abcdef",
"def",
"ghidef",
"jkl",
"mnop",
"qrst",
"uvwxyz",
"ghijkl",
"abcdefgh",
"ijklmnop",
"mnopqrstuv",
"rstuvwx",
"mnopqrstuvwx",
"uvwx",
]
# Extending the list to make it very long and complex
base_keywords = keywords[:]
for i in range(200000):
base_keywords.extend([f"{k}{i}" for k in keywords])
base_keywords.extend([f"{i}{k}" for k in keywords])
base_keywords.extend([f"{i}{k}{i}" for k in keywords])
keywords = base_keywords
old_function_time = 0
new_function_time = 0
for repeat in range(1):
# This loop (not present in the cpp code) is used to loop it multiple times to find an average
start_time = time.time()
result = deduplicate_keywords(keywords)
end_time = time.time()
old_function_time = (
repeat * old_function_time + (end_time - start_time)
) / (repeat + 1)
start_time = time.time()
result = deduplicate_keywords_new(keywords)
# Above is a function that can be edited to check whether changes speed up the code
end_time = time.time()
new_function_time = (
repeat * new_function_time + (end_time - start_time)
) / (repeat + 1)
# As the list is extended with numbers, the result will be unique non-redundant keywords
expected_result = ["def", "jkl", "mnop", "qrst", "uvwx"]
self.assertEqual(set(result), set(expected_result))
print(
f"Test performance_intensive_case for old function took {old_function_time} seconds\n"
f"Test performance_intensive_case for new function took {new_function_time} seconds\n"
)
if old_function_time < new_function_time:
print(f"Old is faster than new")
else:
print(f"New is faster than old")
if __name__ == "__main__":
unittest.main()
这对于我的需求来说太慢了
12 seconds
在我的弱笔记本电脑上。
因此,我尝试在 cpp 中编写相同的内容,看看是否会获得性能增益 如此 慢得多,因此我向函数本身添加了一些计时代码,以查看速度减慢的位置:|| |这将输出到终端(缩短,删除当前空的 deduplicate_keywords_new 函数的日志):
#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <algorithm>
#include <chrono>
#include <cassert>
std::vector<std::string> deduplicate_keywords(const std::vector<std::string>& keywords) {
auto t_start = std::chrono::high_resolution_clock::now();
// Remove duplicates using a set
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << "started old function!" << std::endl;
std::set<std::string> unique_set(keywords.begin(), keywords.end());
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "removed duplicates in " << std::chrono::duration<double>(t2 - t1).count() << " seconds" << std::endl;
// Convert set back to vector and sort by length
auto t3 = std::chrono::high_resolution_clock::now();
std::vector<std::string> sorted_keywords(unique_set.begin(), unique_set.end());
std::sort(sorted_keywords.begin(), sorted_keywords.end(), [](const std::string& a, const std::string& b) {
return a.length() < b.length();
});
auto t4 = std::chrono::high_resolution_clock::now();
std::cout << "Sorted by length in " << std::chrono::duration<double>(t4 - t3).count() << " seconds" << std::endl;
// Filter out redundant keywords
auto t5 = std::chrono::high_resolution_clock::now();
std::vector<std::string> unique_keywords;
for (const auto& keyword : sorted_keywords) {
for (const auto& unique_keyword : unique_keywords) {
// Check if the current keyword includes any of the unique keywords
if (keyword.find(unique_keyword) != std::string::npos) {
goto next_iteration;
}
}
// If the keyword does not include any of the shorter keywords, add it to unique_keywords
unique_keywords.push_back(keyword);
next_iteration:
continue;
}
auto t6 = std::chrono::high_resolution_clock::now();
std::cout << "Filtered in " << std::chrono::duration<double>(t6 - t5).count() << " seconds" << std::endl;
auto t7 = std::chrono::high_resolution_clock::now();
sorted_keywords.clear();
sorted_keywords.shrink_to_fit();
unique_keywords.shrink_to_fit();
auto t8 = std::chrono::high_resolution_clock::now();
std::cout << "Destructed in " << std::chrono::duration<double>(t8 - t7).count() << " seconds" << std::endl;
auto t9 = std::chrono::high_resolution_clock::now();
auto total_time = std::chrono::duration<double>(t9 - t_start).count();
std::cout << "Total function time: " << total_time << " seconds" << std::endl;
return unique_keywords;
}
void test_performance_intensive_case() {
std::vector<std::string> keywords = {
"abcdef", "def", "ghidef", "jkl", "mnop", "qrst", "uvwxyz",
"ghijkl", "abcdefgh", "ijklmnop", "mnopqrstuv", "rstuvwx",
"mnopqrstuvwx", "uvwx"
};
// Extending the list to make it very long and complex
std::vector<std::string> base_keywords = keywords;
for (int i = 0; i < 20000; ++i) {
for (const auto& k : keywords) {
base_keywords.push_back(k + std::to_string(i));
base_keywords.push_back(std::to_string(i) + k);
base_keywords.push_back(std::to_string(i) + k + std::to_string(i));
}
}
auto keywords_extended = base_keywords;
std::set<std::string> expected_result = {"def", "jkl", "mnop", "qrst", "uvwx"};
// there is no looping to calculate an average because it is so much slower
auto start_time_total = std::chrono::high_resolution_clock::now();
auto start_time = std::chrono::high_resolution_clock::now();
std::vector<std::string> result_old = deduplicate_keywords(keywords_extended);
auto end_time = std::chrono::high_resolution_clock::now();
double old_function_time = std::chrono::duration<double>(end_time - start_time).count();
auto end_time_total = std::chrono::high_resolution_clock::now();
double total_test_time = std::chrono::duration<double>(end_time_total - start_time_total).count();
assert(std::set<std::string>(result_old.begin(), result_old.end()) == expected_result);
std::cout << "old function time: " << old_function_time << " seconds" << std::endl;
std::cout << "Total test case time: " << total_test_time << " seconds" << std::endl;
auto start_time_total_new = std::chrono::high_resolution_clock::now();
auto start_time_new = std::chrono::high_resolution_clock::now();
std::vector<std::string>result_new = deduplicate_keywords_new(keywords_extended); // same as in the python code, useful for comparison
auto end_time_new = std::chrono::high_resolution_clock::now();
double new_function_time = std::chrono::duration<double>(end_time - start_time).count();
auto end_time_total_new = std::chrono::high_resolution_clock::now();
double total_test_time_new = std::chrono::duration<double>(end_time_total_new - start_time_total_new).count();
assert(std::set<std::string>(result_new.begin(), result_new.end()) == expected_result);
std::cout << "new function time: " << new_function_time << " seconds" << std::endl;
std::cout << "Total test case time: " << total_test_time_new << " seconds" << std::endl;
std::cout << "Test performance_intensive_case for old function took " << old_function_time
<< " seconds but new function took " << new_function_time << " seconds" << std::endl;
if (old_function_time < new_function_time) {
std::cout << "Old is faster than new" << std::endl;
} else {
std::cout << "New is faster than old" << std::endl;
}
}
int main() {
test_performance_intensive_case();
return 0;
}
除了总体速度降低之外,日志似乎表明返回函数(
started old function!
removed duplicates in 9.64173 seconds
Sorted by length in 3.36138 seconds
Filtered in 0.366933 seconds
Destructed in 7.63208 seconds
Total function time: 21.0037 seconds
old function time: 405.213 seconds
Total test case time: 405.213 seconds
)似乎需要非常长的时间。为什么是这样?我能做些什么来阻止这种情况呢?我对 cpp 编码和开发相当陌生,因此如果您的解释详细而透彻,那就太好了。谢谢!
return unique keywords
) seems to take extremely long. Why is this? And what can I do to stop this? I am fairly new to cpp coding and development, so it would be great if your explainations were detailed and thorough. Thanks!
遇到的 C++ 代码性能问题与返回语句本身无关,而是与向量去分配和复制的方式有关。问题出在
test_performance_intensive_case
函数中的这一行:
std::vector<std::string> result_old = deduplicate_keywords(keywords_extended);
这里发生的是,
deduplicate_keywords
函数返回的
unique_keywords
向量
按值复制
到
result_old
向量。这意味着将复制整个向量内容(包括所有字符串),这是一项代价高昂的操作,尤其是在向量很大时。
C++ 代码的执行时间比 Python 代码长得多的原因是,Python 代码可能在内部优化了此复制操作,或者根本没有进行复制(例如,通过使用某种形式的写时复制)。
为了解决这个问题,可以通过 引用 返回向量:
const std::vector<std::string>& deduplicate_keywords(const std::vector<std::string>& keywords) {
// ... [函数的其余部分保持不变] ...
return unique_keywords;
}
并在
test_performance_intensive_case
函数中按引用接收它:
const std::vector<std::string>& result_old = deduplicate_keywords(keywords_extended);
通过按引用返回向量,避免了复制整个向量及其内容,这将显着提高代码的性能。
除了这个更改之外,以下是一些可以进一步优化 C++ 代码的额外提示:
-
使用
reserve()
预分配向量容量: 在unique_keywords
向量中添加元素之前,使用unique_keywords.reserve(sorted_keywords.size())
预分配其容量。这将最大程度地减少向量在添加元素时需要调整大小的次数。 -
考虑使用
std::unordered_set
来代替std::set
: 如果不需要对关键字进行排序,则可以使用std::unordered_set
来实现更快的重复项删除。 - 使用分析器来识别瓶颈: 分析器可以帮助确定代码中占用最多时间的特定部分,这使可以专注于优化这些区域。
通过应用这些更改并进一步分析的代码,可以显着提高其性能并使其更快地执行。
标签:python,c++,algorithm,optimization,string-matching From: 78812814