维基百科页面的 bs4 方法：获取信息框

标签：python pandas beautifulsoup wikipedia

我目前正在尝试将 bs4 方法应用于维基百科页面：结果不会存储在 df

中，因为维基百科上的抓取是一种非常非常常见的技术 - 我们可以使用适当的方法来处理许多不同的工作 - 我在获取结果方面确实遇到了一些问题 - 并将其存储到 df

中 - 作为一个非常常见的 Wikipedia-bs4 工作的示例 - 我们可以采用这个：

关于这个页面我们有超过 600 个结果 - 在子页面中： url = "https://de.wikipedia.org/wikiListe_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland"

所以要做我遵循的第一个实验脚本像这样：首先我从维基百科页面抓取表格，然后将其转换为 Pandas DataFrame。因此，我首先安装必要的软件包：确保安装了 requests、beautifulsoup4 和 pandas。如果您还没有安装它们，您可以使用 pip 安装它们：

pip install requests beautifulsoup4 pandas

然后我会像这样操作：首先我从 Wikipedia 页面上抓取表格，然后将其转换为 Pandas DataFrame。

import pandas as pd

# URL of the Wikipedia page
url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland"
table = pd.read_html(url, extract_links='all')[1]
base_url = 'https://de.wikipedia.org'
table = table.apply(lambda col: [v[0] if v[1] is None else f'{base_url}{v[1]}' for v in  col])


links = list(table.iloc[:,0])

for link in links:
    print('\n',link)
    try:
        df = pd.read_html(link)[0]
        print(df)
    except Exception as e:
        print(e)

看看我得到了什么——只有两条记录。而不是数百。顺便提一句; 我想最好的方法是将所有内容收集到 df 中。和/或存储它

Document is empty

 https://de.wikipedia.org/wiki/Aach_(Hegau)
                                       Wappen  \
0                                         NaN   
1                                         NaN   
2                                  Basisdaten   
3                                Koordinaten:   
4                                 Bundesland:   
5                           Regierungsbezirk:   
6                                  Landkreis:   
7                                       Höhe:   
8                                     Fläche:   
9                                  Einwohner:   
10                        Bevölkerungsdichte:   
11                              Postleitzahl:   
12                                   Vorwahl:   
13                           Kfz-Kennzeichen:   
14                         Gemeindeschlüssel:   
15                                    LOCODE:   
16              Adresse der  Stadtverwaltung:   
17                                   Website:   
18                             Bürgermeister:   
19  Lage der Stadt Aach im Landkreis Konstanz   
20                                      Karte   

                                     Deutschlandkarte  
0                                                 NaN  
1                                                 NaN  
2                                          Basisdaten  
3   47° 51′ N, 8° 51′ OKoordinaten: 47° 51′ N, 8° ...  
4                                   Baden-Württemberg  
5                                            Freiburg  
6                                            Konstanz  
7                                        545 m ü. NHN  
8                                           10,68 km2  
9                             2384 (31. Dez. 2022)[1]  
10                               223 Einwohner je km2  
11                                              78267  
12                                              07774  
13                                            KN, STO  
14                                        08 3 35 001  
15                                             DE AAC  
16                         Hauptstraße 16  78267 Aach  
17                                        www.aach.de  
18                                     Manfred Ossola  
19          Lage der Stadt Aach im Landkreis Konstanz  
20                                              Karte

注意：我们那里有数百条记录：

查看信息框：我想获取信息框的数据

更新：目的是什么： - 如何获取完整结果 - 存储在 df. - 并包含所有数据 - 在 info.box 中..（参见上图） - 包含联系信息等

update2:

概述 - 页面： https://de.wikipedia.org /wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland

它带我们到大约 1000 个子页面：如下所示

Aach (Hegau)： https://de.wikipedia.org /wiki/Aach_(Hegau) 亚琛: https://de.wikipedia.org/wiki/Aachen 阿伦: https://de.wikipedia.org/wiki/Aalen| ||查看所谓“信息框”的结果：

https://de.wikipedia.org/wiki/Babenhausen_(Hessen) Babenhausen (Hessen) https://de .wikipedia.org/wiki/Bacharach

+----------------------+--------------------------------------------------------------+
|                      |                                                              |
+----------------------+--------------------------------------------------------------+
| koordinaten:         | ♁49° 58′ N, 8° 57′ OKoordinaten: 49° 58′ N, 8° 57′ O | | OSM |
| Bundesland:          | Hessen                                                       |
| Regierungsbezirk:    | Darmstadt                                                    |
| Landkreis:           | Darmstadt-Dieburg                                            |
| Höhe:                | 124 m ü. NHN                                                 |
| Fläche:              | 66,85 km2                                                    |
| Einwohner:           | 17.579 (31. Dez. 2023)[1]                                    |
| Bevölkerungsdichte:  | 263 Einwohner je km2                                         |
| Postleitzahl:        | 64832                                                        |
| Vorwahl:             | 06073                                                        |
| Kfz-Kennzeichen:     | DA, DI                                                       |
| Gemeindeschlüssel:   | 06 4 32 002                                                  |
| Stadtgliederung:     | 6 Stadtteile                                                 |
| Adresse der          |                                                              |
| Stadtverwaltung:     | Rathaus                                                      |
| Marktplatz 2         |                                                              |
| 64832 Babenhausen    |                                                              |
| Website:             | www.babenhausen.de                                           |
| Bürgermeister:       | Dominik Stadler (parteilos)                                  |
+----------------------+--------------------------------------------------------------+

https://de.wikipedia.org/wiki/Backnang update3:

如果我运行此代码以获取 300 条记录。它工作得很好 - 如果我运行它来获取 2400 它会失败.. if i run this code in order to fetch 300 records . it works well - if i run this in order to fetch 2400 it fails..

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_info(city_url: str) -> dict:
    info_data = {}
    response = requests.get(city_url)
    soup = BeautifulSoup(response.text, 'lxml')
    for x in soup.find('tbody').find_all(
            lambda tag: tag.name == 'tr' and tag.get('class') == ['hintergrundfarbe-basis']):
        if not x.get('style'):
            if 'Koordinaten' in x.get_text():
                info_data['Koordinaten'] = x.findNext('span', class_='coordinates').get_text()
            else:
                info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
                info_data['Web site'] = soup.find('a', {'title':'Website'}).findNext('a').get('href')
    return info_data


cities = []
response = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland')
soup = BeautifulSoup(response.text, 'lxml')
for city in soup.find_all('dd')#[:2500]:
    city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
    result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
    result |= get_info(city_url)
    cities.append(result)
df = pd.DataFrame(cities)
print(df.to_string())


------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-28-4391c852fd75> in <cell line: 24>()
     25     city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
     26     result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
---> 27     result |= get_info(city_url)
     28     cities.append(result)
     29 df = pd.DataFrame(cities)

<ipython-input-28-4391c852fd75> in get_info(city_url)
     15             else:
     16                 info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
---> 17                 info_data['Web site'] = soup.find('a', {'title':'Website'}).findNext('a').get('href')
     18     return info_data
     19 

AttributeError: 'NoneType' object has no attribute 'findNext'

错误发生是因为在某些页面上，代码找不到带有属性 title="Website" 的 <a> 标签。这意味着 soup.find('a', {'title':'Website'}) 返回 None ，而 None 没有 findNext 方法。

为了解决这个问题，可以添加一个检查，以确保在尝试访问 findNext 之前找到了该元素：

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_info(city_url: str) -> dict:
    info_data = {}
    response = requests.get(city_url)
    soup = BeautifulSoup(response.text, 'lxml')
    for x in soup.find('tbody').find_all(
            lambda tag: tag.name == 'tr' and tag.get('class') == ['hintergrundfarbe-basis']):
        if not x.get('style'):
            if 'Koordinaten' in x.get_text():
                info_data['Koordinaten'] = x.findNext('span', class_='coordinates').get_text()
            else:
                info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
                # 检查是否找到了 "Website" 链接
                website_link = soup.find('a', {'title':'Website'})
                if website_link:
                    info_data['Web site'] = website_link.findNext('a').get('href')
    return info_data


cities = []
response = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland')
soup = BeautifulSoup(response.text, 'lxml')
for city in soup.find_all('dd'):
    city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
    result = {'City': city.get_text(), 'URL': city_url}
    result |= get_info(city_url)
    cities.append(result)
df = pd.DataFrame(cities)
print(df.to_string())

这将确保只有在找到 Website 链接时才会尝试提取链接，从而防止 AttributeError 。

此外，对于处理大量页面，还可以考虑以下几点：

使用多线程或异步请求库，例如 concurrent.futures 或 asyncio ，以便更快地抓取数据。
在每次请求之间添加一些延迟，以避免向 Wikipedia 服务器发送过多的请求。
如果出现错误，请实现错误处理机制，例如重试失败的请求或记录错误以供以后分析。

标签：python,pandas,beautifulsoup,wikipedia
From： 78804323

维基百科页面的 bs4 方法：获取信息框

相关文章

赞助商

阅读排行