比较列表中的标题并找到相似的标题

标签：python scikit-learn tfidfvectorizer

我编写了一个 Python 代码，该代码接收产品标题作为输入，并从演示文稿中查找类似的标题。一切都运行良好，但它错误地识别了一些标题。我认为它错误地识别了带有数字的标题

说明： get_price(myProductTitle) 函数的输入是一个标题，例如： Razer Gold PIN Malaysia 7 MYR

检查下面列出

product_spans = soup.find_all('span', class_='variable-item-span')

示例输出：

['Razer Gold MY - 5 MYR', 'Razer Gold MY - 500 MYR', 'Razer Gold MY - 20 MYR', 'Razer Gold MY - 200 MYR', 'Razer Gold MY - 7 MYR', 'Razer Gold MY - 100 MYR', 'Razer Gold MY - 30 MYR', 'Razer Gold MY - 10 MYR', 'Razer Gold MY - 300 MYR', 'Razer Gold MY - 50 MYR', 'Razer Gold MY - 40 MYR', 'Razer Gold MY - 3 MYR']

例如，我的标题应该来自上面的列表查找 Razer Gold MY - 7 MYR 但这是错误的

我的代码：

from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from woocommerce import API
import re

wcapi = API(
url="http://example.com",
consumer_key="",
consumer_secret="",
version="wc/v3"
)

response = wcapi.get("products", params={"per_page": 99, "category": 74 })

chrome_options = Options()
chrome_options.add_argument("--headless")

def preprocess(text):
text = re.sub(r'[^A-Za-z0-9\s]', '', text).lower()
return text
def get_price(myProductTitle):
driver = webdriver.Chrome(options=chrome_options)
print( myProductTitle)

url = "http://example.com/product/razer-gold-myr/"
driver.get(url)
time.sleep(2)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'variable-item-span'))
)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# product_spans = soup.find_all('span', class_='variable-item-span')

  product_spans = ['Razer Gold MY - 5 MYR', 'Razer Gold MY - 500 MYR', 
  'Razer 
  Gold MY - 20 MYR', 'Razer Gold MY - 200 MYR', 'Razer Gold MY - 7 
  MYR', 
 'Razer Gold MY - 100 MYR', 'Razer Gold MY - 30 MYR', 'Razer Gold MY - 
  10 
  MYR', 'Razer Gold MY - 300 MYR', 'Razer Gold MY - 50 MYR', 'Razer 
  Gold MY - 
  40 MYR', 'Razer Gold MY - 3 MYR']

if not product_spans:
    print("No product spans found.")
    driver.quit()
    return

texts = [span.get_text().strip() for span in product_spans]
print("Extracted texts:", texts)
if not texts:
    print("No texts to compare.")
    driver.quit()
    return

preprocessed_title = preprocess(myProductTitle)
preprocessed_texts = [preprocess(text) for text in texts]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([preprocessed_title] + preprocessed_texts)
cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])
print("Cosine similarities:", cosine_similarities)

if cosine_similarities.max() == 0:
    print("No similar product found.")
    driver.quit()
    return

most_similar_index = cosine_similarities.argmax()
most_similar_text = texts[most_similar_index]
most_similar_span = product_spans[most_similar_index]

print(most_similar_text)
print(most_similar_span)
driver.quit()


if response.status_code == 200:
products = response.json()
for product in products:
    get_price(product['name'])
else:
print('error product not found')

比较列表中的标题并找到类似的标题

下载完整代码

代码的问题在于，它过分依赖 TF-IDF 和余弦相似度来查找相似的标题。虽然这些方法适用于自然语言处理，但对于此任务来说过于复杂，可能会导致不准确。具体来说，TF-IDF 会根据词语在文档集中出现的频率对词语进行加权，这意味着数字（如“7 MYR”）如果在数据集中很少见，可能会被认为不重要。

与其使用 TF-IDF 和余弦相似度，不如尝试更简单的基于规则的方法来比较标题。例如，可以尝试以下操作：

按名称匹配产品： 由于的产品标题结构似乎一致（“产品名称 - 值”），因此可以简单地提取要比较的两个标题的产品名称和值。如果产品名称匹配，则已经找到了一个匹配的项目，而无需进行复杂的比较。以下是如何执行此操作的示例：

def extract_product_info(title):
  """从标题中提取产品名称和值。"""
  parts = title.split(' - ')
  if len(parts) == 2:
    return parts[0].strip(), parts[1].strip()
  else:
    return title, None

def get_price(myProductTitle):
  # ...（现有的代码以提取 product_spans）

  my_product_name, my_product_value = extract_product_info(myProductTitle)

  for span in product_spans:
    span_text = span.get_text().strip()
    span_product_name, span_product_value = extract_product_info(span_text)

    if my_product_name == span_product_name:
      print("找到匹配项：", span_text)
      # ...（在此处添加的价格提取逻辑）
      return

  print("未找到匹配项。")

如果需要更精细的比较： 如果严格的名称匹配不够强大，可以使用 difflib.SequenceMatcher 计算两个标题之间的相似度比率。这将考虑字符级别的差异，即使数字略有不同也能提供更准确的匹配。

这是一个使用 SequenceMatcher 的示例：

from difflib import SequenceMatcher

def get_price(myProductTitle):
  # ...（现有的代码以提取 product_spans）

  best_match = None
  best_similarity = 0

  for span in product_spans:
    span_text = span.get_text().strip()
    similarity = SequenceMatcher(None, myProductTitle, span_text).ratio()

    if similarity > best_similarity:
      best_similarity = similarity
      best_match = span_text

  if best_match:
    print("找到最佳匹配项：", best_match)
    # ...（在此处添加的价格提取逻辑）
  else:
    print("未找到匹配项。")

这些基于规则的方法应该比 TF-IDF 和余弦相似度提供更准确且更有效的结果，用于比较产品标题。

标签：python,scikit-learn,tfidfvectorizer
From： 78802387

比较列表中的标题并找到相似的标题

相关文章

赞助商

阅读排行