我编写了一个 Python 代码,该代码接收产品标题作为输入,并从演示文稿中查找类似的标题。一切都运行良好,但它错误地识别了一些标题。 我认为它错误地识别了带有数字的标题
说明: get_price(myProductTitle) 函数的输入是一个标题,例如: Razer Gold PIN Malaysia 7 MYR
检查下面列出
product_spans = soup.find_all('span', class_='variable-item-span')
示例输出:
['Razer Gold MY - 5 MYR', 'Razer Gold MY - 500 MYR', 'Razer Gold MY - 20 MYR', 'Razer Gold MY - 200 MYR', 'Razer Gold MY - 7 MYR', 'Razer Gold MY - 100 MYR', 'Razer Gold MY - 30 MYR', 'Razer Gold MY - 10 MYR', 'Razer Gold MY - 300 MYR', 'Razer Gold MY - 50 MYR', 'Razer Gold MY - 40 MYR', 'Razer Gold MY - 3 MYR']
例如,我的标题应该来自上面的列表 查找 Razer Gold MY - 7 MYR 但这是错误的
我的代码:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from woocommerce import API
import re
wcapi = API(
url="http://example.com",
consumer_key="",
consumer_secret="",
version="wc/v3"
)
response = wcapi.get("products", params={"per_page": 99, "category": 74 })
chrome_options = Options()
chrome_options.add_argument("--headless")
def preprocess(text):
text = re.sub(r'[^A-Za-z0-9\s]', '', text).lower()
return text
def get_price(myProductTitle):
driver = webdriver.Chrome(options=chrome_options)
print( myProductTitle)
url = "http://example.com/product/razer-gold-myr/"
driver.get(url)
time.sleep(2)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'variable-item-span'))
)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# product_spans = soup.find_all('span', class_='variable-item-span')
product_spans = ['Razer Gold MY - 5 MYR', 'Razer Gold MY - 500 MYR',
'Razer
Gold MY - 20 MYR', 'Razer Gold MY - 200 MYR', 'Razer Gold MY - 7
MYR',
'Razer Gold MY - 100 MYR', 'Razer Gold MY - 30 MYR', 'Razer Gold MY -
10
MYR', 'Razer Gold MY - 300 MYR', 'Razer Gold MY - 50 MYR', 'Razer
Gold MY -
40 MYR', 'Razer Gold MY - 3 MYR']
if not product_spans:
print("No product spans found.")
driver.quit()
return
texts = [span.get_text().strip() for span in product_spans]
print("Extracted texts:", texts)
if not texts:
print("No texts to compare.")
driver.quit()
return
preprocessed_title = preprocess(myProductTitle)
preprocessed_texts = [preprocess(text) for text in texts]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([preprocessed_title] + preprocessed_texts)
cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])
print("Cosine similarities:", cosine_similarities)
if cosine_similarities.max() == 0:
print("No similar product found.")
driver.quit()
return
most_similar_index = cosine_similarities.argmax()
most_similar_text = texts[most_similar_index]
most_similar_span = product_spans[most_similar_index]
print(most_similar_text)
print(most_similar_span)
driver.quit()
if response.status_code == 200:
products = response.json()
for product in products:
get_price(product['name'])
else:
print('error product not found')
比较列表中的标题并找到类似的标题
代码的问题在于,它过分依赖 TF-IDF 和余弦相似度来查找相似的标题。虽然这些方法适用于自然语言处理,但对于此任务来说过于复杂,可能会导致不准确。具体来说,TF-IDF 会根据词语在文档集中出现的频率对词语进行加权,这意味着数字(如“7 MYR”)如果在数据集中很少见,可能会被认为不重要。
与其使用 TF-IDF 和余弦相似度,不如尝试更简单的基于规则的方法来比较标题。例如,可以尝试以下操作:
- 按名称匹配产品: 由于的产品标题结构似乎一致(“产品名称 - 值”),因此可以简单地提取要比较的两个标题的产品名称和值。如果产品名称匹配,则已经找到了一个匹配的项目,而无需进行复杂的比较。以下是如何执行此操作的示例:
def extract_product_info(title):
"""从标题中提取产品名称和值。"""
parts = title.split(' - ')
if len(parts) == 2:
return parts[0].strip(), parts[1].strip()
else:
return title, None
def get_price(myProductTitle):
# ...(现有的代码以提取 product_spans)
my_product_name, my_product_value = extract_product_info(myProductTitle)
for span in product_spans:
span_text = span.get_text().strip()
span_product_name, span_product_value = extract_product_info(span_text)
if my_product_name == span_product_name:
print("找到匹配项:", span_text)
# ...(在此处添加的价格提取逻辑)
return
print("未找到匹配项。")
-
如果需要更精细的比较:
如果严格的名称匹配不够强大,可以使用
difflib.SequenceMatcher
计算两个标题之间的相似度比率。这将考虑字符级别的差异,即使数字略有不同也能提供更准确的匹配。
这是一个使用
SequenceMatcher
的示例:
from difflib import SequenceMatcher
def get_price(myProductTitle):
# ...(现有的代码以提取 product_spans)
best_match = None
best_similarity = 0
for span in product_spans:
span_text = span.get_text().strip()
similarity = SequenceMatcher(None, myProductTitle, span_text).ratio()
if similarity > best_similarity:
best_similarity = similarity
best_match = span_text
if best_match:
print("找到最佳匹配项:", best_match)
# ...(在此处添加的价格提取逻辑)
else:
print("未找到匹配项。")
这些基于规则的方法应该比 TF-IDF 和余弦相似度提供更准确且更有效的结果,用于比较产品标题。
标签:python,scikit-learn,tfidfvectorizer From: 78802387