python爬虫遇到的问题

标签：span 遇到 python text title 爬虫 find strip class

python爬虫之获取文本信息时，经常犯的错误

如果代码部分不想看的可以直接看后面粗体字

1.第一个例子

import requests

from bs4 import BeautifulSoup

import pandas as pd

import time

import random

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

}

movies_list=[]

def climb_movies(page):

url=f'https://movie.douban.com/top250?start={page*25}&filter='

response=requests.get(url,headers=headers)

if response.status_code==200:

soup=BeautifulSoup(response.text,'html.parser')

print(f"成功获取{page+1}页数据")

#解析整页网页结构

movies_all=soup.find_all('div',class_="info")

#循环获取html标签中的信息

for all in movies_all:

name=all.find('span',class_="title").text.strip()if all.find('span',class_="title")else '未知'

rating=all.find('span',class_="rating_num").text.strip()if all.find('span',class_="rating_num")else"未知"

jingdian=all.find('span',class_="inq").text.strip()if all.find('span',class_="inq") else"未知"

comment_number=all.find('div',class_="star").find_all('span')[-1].text.strip().replace('人评价','') ifall.find('div',class_="star").find_all('span')[-1].text.strip().replace('人评价','') else"未知"

print(f"电影名：{name},评分：{rating}分，评论人数：{comment_number}")

movies_list.append({'电影名':name,'评分':rating,'描述':jingdian,'评论人数':comment_number})

print(f"第{page+1}页数据爬取成功")

else:

print(f"无法获取{page+1}数据,状态码：{response.status_code}")

#调用函数并循环获取没页电影的相关信息

for page in range (10):

climb_movies(page)

time.sleep(1)

df=pd.DataFrame(movies_list)

df.to_excel('movie.xlsx',index=False,engine='openpyxl')

print("数据已存到--豆瓣movie.xlsx")

例子:

name=all.find('span',class_="title").text.strip() if all.find('span',class_="title") else '未知'

思考：为什么if语句会放在获取标签元素语句（后文都称为目标语句）之后？

共同点：目标语句都有都有.text.strip()方法，该方法是为了获取标签元素的文本内容

上句的意思为:如果 all.find('span', class_="title") 找到一个结果（即不为 None），则执行 all.find('span', class_="title").text.strip()，即获取标题并去掉两端的空格否则，如果 all.find('span', class_="title") 返回 None（即没有找到该元素），则赋值 "未知"。

这种写法是为了避免在 all.find('span', class_="title") 返回 None 时调用 .text.strip() 发生错误。注：如果标签文本内容为空，用.text.strip()会发生错误

2.那我们在看一个例子

import requests

from bs4 import BeautifulSoup

import time

import random

import pandas as pd

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

}

book_data = []

def scrape_books(page):

url=f"https://category.dangdang.com/pg{page}-cp01.01.00.00.00.00.html"

response=requests.get(url,headers=headers)

if response.status_code==200:

print(f"成功获取{page}页数据")

soup=BeautifulSoup(response.text,'html.parser')

books=soup.find_all('li',class_="line1")

for book in books:

if book.find('a',class_="pic"):

title=book.find('a',class_="pic").get('title').strip()

else :

print('未知')

if book.find('p',class_="search_book_author"):

author=book.find('p',class_="search_book_author").text.strip()

else :

print('未知')

if book.find('span',class_="search_pre_price"):

price=book.find('span',class_="search_pre_price").text.strip()

else :

print('未知')

if book.find('a',class_="search_comment_num"):

comments=book.find('a',class_="search_comment_num").text.strip()

else :

print("0条评论")

print(f"书名：{title},作者：{author},价格：{price},评论数：{comments}")

book_data.append({'书名':title,'作者':author,'价格':price,'评论数':comments})

else :

print(f"获取{page}页数据失败，状态码：{response.status_code}")

for page in range (1,3):

scrape_books(page)

time.sleep(random.uniform(1,3))

df=pd.DataFrame(book_data)

df.to_excel('dangdang图书.xlsx',index=False,engine='openpyxl')

print("数据已保存到 dangdang图书.xlsx中")

总结:这个网页的目标语句所包含的标签元素中恰好都有文本内容避免了获取到的文本内容为空的情况

if book.find('span',class_="search_pre_price"):

price=book.find('span',class_="search_pre_price").text.strip()

else :

print('未知')

if后的条件都为真，所以能正常输出

总结：在 Python 中，您所看到的 if 写在后面的这种写法是三元表达式（Ternary Expression），它是 Python 中的一种简洁写法，用来在一行中实现条件判断。它的格式是：

<表达式1> if <条件> else <表达式2>

意思是：如果 <条件> 为 True，则返回 <表达式1>，否则返回 <表达式2>。

在我的代码中：

name = all.find('span', class_="title").text.strip() if all.find('span', class_="title") else "未知"

它的意思是：

如果 all.find('span', class_="title") 找到一个结果（即不为 None），则执行 all.find('span', class_="title").text.strip()，即获取标题并去掉两端的空格；

标签：span,遇到,python,text,title,爬虫,find,strip,class
From： https://blog.csdn.net/sxdjdjdb/article/details/143216583

相关文章

赞助商

阅读排行