尝试将 bs4 方法应用于维基百科页面：结果不存储在 df 中

标签：python pandas web-scraping beautifulsoup

由于维基百科上的抓取是一种非常非常常见的技术 - 我们可以使用适当的方法来处理许多不同的工作 - 我在获取结果方面确实遇到了一些问题 - 并将其存储到 df

中好吧 - 作为一个非常常见的 Wikipedia-bs4 作业的示例 - 我们可以采用这个：

在此页面上，我们有超过 600 个结果 - 在子页面中： url = "https://de.wikipedia.org/wiki/Liste_der_Genossenschaftsbanken_in_Deutschland"

所以要做第一个实验脚本，我如下所示：首先我从维基百科页面抓取表格，然后将其转换为 Pandas DataFrame .

因此我首先安装必要的软件包：确保您安装了 requests、beautifulsoup4 和 pandas。如果您还没有安装它们，您可以使用 pip 安装它们：

pip install requests beautifulsoup4 pandas

然后我会像这样操作：首先我从 Wikipedia 页面上抓取表格，然后将其转换为 Pandas DataFrame。

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Wikipedia page
url = "https://de.wikipedia.org/wiki/Liste_der_Genossenschaftsbanken_in_Deutschland"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the first table in the page
table = soup.find('table', {'class': 'wikitable'})

# Initialize an empty list to store the data
data = []

# Iterate over the rows of the table
for row in table.find_all('tr'):
    # Get the columns in each row
    cols = row.find_all('td')
    # If there are columns in the row, get the text from each column and store it in the data list
    if cols:
        data.append([col.get_text(strip=True) for col in cols])

# Convert the data list to a Pandas DataFrame
df = pd.DataFrame(data, columns=["Bank Name", "Location", "Website"])

# Display the DataFrame
print(df)

# Optionally, save the DataFrame to a CSV file
df.to_csv('genossenschaftsbanken.csv', index=False)

看看我有什么返回：

3 s

# Display the DataFrame
print(df)

# Optionally, save the DataFrame to a CSV file
df.to_csv('genossenschaftsbanken.csv', index=False)

  Bank Name                                           Location  \
0      BWGV  Baden-Württembergischer Genossenschaftsverband...   
1       GVB                 Genossenschaftsverband Bayerne. V.   
2        GV                                   Genoverbande. V.   
3      GVWE              Genossenschaftsverband Weser-Emse. V.   
4       FGV                Freier Genossenschaftsverband e. V.   
5       PDG     PDG Genossenschaftlicher Prüfungsverband e. V.   
6                              Verband der Sparda-Banken e. V.   
7                                 Verband der PSD Banken e. V.   

             Website  
0          Karlsruhe  
1            München  
2  Frankfurt am Main  
3          Oldenburg  
4         Düsseldorf  
5             Erfurt  
6  Frankfurt am Main  
7               Bonn

好吧，我想我必须重新编写脚本的结尾...

更新：目的 - 是将数据从块中取出，请参阅示例：

https ://de.wikipedia.org/wiki/Abtsgm%C3%BCnder_Bank

的代码抓取了维基百科页面上的主要表格，其中仅列出了德国合作银行的协会，而不是想要的每个银行。

为了从提供的链接中提取所有合作银行的数据，需要修改的代码以访问每个合作银行的子页面并提取信息。

以下是如何做到这一点：

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the main Wikipedia page
url = "https://de.wikipedia.org/wiki/Liste_der_Genossenschaftsbanken_in_Deutschland"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all links to the subpages of each bank
bank_links = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href and "/wiki/" in href and not href.startswith("#"):
        bank_links.append("https://de.wikipedia.org" + href)

# Initialize an empty list to store the data
data = []

# Iterate over the bank links and extract data from each subpage
for link in bank_links:
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract bank name, location, and website
    bank_name = soup.find('h1', id='firstHeading').text.strip()

    infobox = soup.find('table', class_='infobox')
    location = ""
    website = ""

    if infobox:
        for row in infobox.find_all('tr'):
            cells = row.find_all('td')
            if len(cells) == 2:
                key = cells[0].text.strip()
                value = cells[1].text.strip()
                if key == "Sitz":
                    location = value
                elif key == "Website":
                    website = value

    data.append([bank_name, location, website])

# Convert the data list to a Pandas DataFrame
df = pd.DataFrame(data, columns=["Bank Name", "Location", "Website"])

# Display the DataFrame
print(df)

# Optionally, save the DataFrame to a CSV file
df.to_csv('genossenschaftsbanken.csv', index=False)

这段代码将：

从维基百科页面中提取所有指向子页面的链接。
遍历每个子页面并提取银行名称、位置和网站（如果可用）。
将提取的数据存储在一个列表中，然后将其转换为 Pandas DataFrame。

这应该可以帮助从维基百科页面中获取想要的数据。请注意，此代码依赖于页面的结构，如果维基百科页面的结构发生变化，则可能需要调整代码。

标签：python,pandas,web-scraping,beautifulsoup
From： 78788573

尝试将 bs4 方法应用于维基百科页面：结果不存储在 df 中

相关文章

赞助商

阅读排行