如何使用 beautifulsoup4 选择部分 HTML 标签进行网页抓取

标签：python html python-3.x web-scraping beautifulsoup

这是我试图从中抓取数据的网站的链接： https://www.fotmob.com/leagues/47/stats/season/20720/players/goals/premier-league 我想使用 beautifulsoup4 选择 class = 'css-653rx1-StatsContainer eozqs6r5' 的部分。

在您提到 find() 和 find_all() 之前，我已经使用了两者，但由于某种原因，就像部分标签不存在一样。当我尝试section = soup.find('section', class_='css-653rx1-StatsContainer eozqs6r5') 时，它没有返回。当我尝试section = soup.find_all('section', class_='css-653rx1-StatsContainer eozqs6r5')时，它返回一个空列表。

然后我遍历了DOM，并且能够选择该部分之前的每个div 。 oncd 我尝试访问该部分，但它没有返回。

这是我的代码

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the webpage you want to scrape
url = 'https://www.fotmob.com/leagues/47/stats/season/20720/players/goals/premier-league'

# Send HTTP request to the URL
response = requests.get(url)

# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')

# Remove <style> tags
for style in soup.find_all('style'):
    style.decompose()

# Remove <script> tags
for script in soup.find_all('script'):
    script.decompose()

outer_main = soup.find('main', class_='css-1cyagd9-PageContainerStyles e19hkjx10')


section = soup.find_all('section', class_='css-653rx1-StatsContainer eozqs6r5')
for sec in section:
    print("T")


#print(soup.prettify())
# Print the HTML of the outer <main> for debugging
if outer_main:
    print("Outer <main> found.")
else:
    print("The outer <main> tag with the specified class was not found.")


# Navigate through the HTML structure to find the target div
main_div = outer_main.find('main') if outer_main else None  # Adjust this if nested <main> exists
if main_div:
    print("Found inner <main>.")
else:
    print("Inner <main> not found.")
   
div1 = main_div.find('div', class_='css-xxmbx0-LeagueSeasonStatsColumn eozqs6r0') if main_div else None
if div1:
    print("Found div1 with class 'css-xxmbx0-LeagueSeasonStatsColumn eozqs6r0'.")
else:
    print("div1 with class 'css-xxmbx0-LeagueSeasonStatsColumn eozqs6r0' not found.")


div1 = div1.find('div', class_='css-1wb2t24-CardCSS e1mlfzv61') if div1 else None
if div1:
    print("Found div1 with class 'css-1wb2t24-CardCSS e1mlfzv61'.")
else:
    print("div1 with class 'css-1wb2t24-CardCSS e1mlfzv61' not found.")


div1 = div1.find('div', class_='css-1yndnk3-LeagueSeasonStatsContainerCSS eozqs6r1') if div1 else None
if div1:
    print("Found div1 with class 'css-1yndnk3-LeagueSeasonStatsContainerCSS eozqs6r1'.")
else:
    print("div1 with class 'css-1yndnk3-LeagueSeasonStatsContainerCSS eozqs6r1' not found.")


div1 = div1.find('section', class_='css-653rx1-StatsContainer eozqs6r5') if div1 else None
if div1:
    print("Found section with class 'css-653rx1-StatsContainer eozqs6r5'.")
else:
    print("section with class 'css-653rx1-StatsContainer eozqs6r5' not found.")


div1 = div1.find('div', class_='css-fvfi51-LeagueSeasonStatsTableCSS e15r3kn20') if div1 else None
if div1:
    print("Found div with class 'css-fvfi51-LeagueSeasonStatsTableCSS e15r3kn20'.")
else:
    print("div with class 'css-fvfi51-LeagueSeasonStatsTableCSS e15r3kn20' not found.")

结果：

Outer <main> found.
Found inner <main>.
Found div1 with class 'css-xxmbx0-LeagueSeasonStatsColumn eozqs6r0'.
Found div1 with class 'css-1wb2t24-CardCSS e1mlfzv61'.
Found div1 with class 'css-1yndnk3-LeagueSeasonStatsContainerCSS eozqs6r1'.
section with class 'css-653rx1-StatsContainer eozqs6r5' not found.
div with class 'css-fvfi51-LeagueSeasonStatsTableCSS e15r3kn20' not found.

我尝试删除脚本和样式标签，因为我在 Develeper 工具内的 HTML 中看不到它们。

基本上，我根据其类别选择每个 div，但由于某种原因，当我尝试选择该部分时，它不起作用。我还尝试遍历包含该部分的父 div 之后的每个元素，但由于某种原因，它只是跳过该部分并继续下一个 html 元素，就像该部分不存在一样。

我什至不确定什么此时要做的。当我运行 print(soup.prettify()) 时，部分标签没有显示。这非常令人困惑，因为我可以清楚地看到开发人员工具中的部分标签。任何有关如何选择部分工具的帮助都将不胜感激！

此外，为了它的价值，我尝试了 Selenium，但这让我发疯。我的 chrome 版本（版本 127.0.6533.100）似乎没有 Chromedriver，而最新版本的 chromedrivers 是（版本：127.0.6533.99）。至少这是我最好的解释。

你抓取不到那个部分中内容的原因是，那个特定的 HTML 元素（section tag）是由 JavaScript 加载的，而你用 requests 库做的是抓取原始的 HTML。 requests 库并不执行网页上的 JavaScript 代码。

为了解决这个问题，你需要使用类似于 Selenium 或者 Playwright 这样的工具，这些工具可以渲染网页并执行 JavaScript。

下面是用 Selenium 抓取数据的例子：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

# 指定网站 URL
url = "https://www.fotmob.com/leagues/47/stats/season/20720/players/goals/premier-league"

# 初始化 Selenium WebDriver
driver = webdriver.Chrome()  # 确保你已经下载了 ChromeDriver 并将其添加到 PATH 中

# 打开网页
driver.get(url)

# 等待目标元素加载完毕
section = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "section.css-653rx1-StatsContainer.eozqs6r5"))
)

# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(driver.page_source, 'html.parser')

# 找到目标 section
target_section = soup.find('section', class_='css-653rx1-StatsContainer eozqs6r5')

# 打印 section 的内容
print(target_section)

# 关闭浏览器
driver.quit()

这段代码会：

初始化 Selenium WebDriver 并打开 Chrome 浏览器。
打开目标网页 。
等待目标 section 元素加载完毕 。这里使用了 WebDriverWait 和 expected_conditions 来确保 JavaScript 加载完数据后再进行下一步操作。
获取网页源代码 并使用 BeautifulSoup 进行解析。
找到目标 section 并打印其内容。
关闭浏览器 。

请确保你已经下载了 ChromeDriver 并将其添加到你的系统 PATH 中。 你可以从这里下载 ChromeDriver: https://chromedriver.chromium.org/downloads

希望这个例子可以帮助你抓取到想要的数据！

标签：python,html,python-3.x,web-scraping,beautifulsoup
From： 78850195

如何使用 beautifulsoup4 选择部分 HTML 标签进行网页抓取

相关文章

赞助商

阅读排行