爬取豆瓣top250影片资料（待修改）

使用BeautifulSoup方法进行操作，CSS选择器截取html文本内容，对网页解析如。

import requests
from bs4 import BeautifulSoup

#避免反复获取出现爬取失败

#头请求用于防止访问拒绝，亦可加cookies
def page_request(url,headers):
    htmltxt=requests.get(url,headers=headers).content.decode('utf-8')
    htmltxt = htmltxt.replace('<br>', "")
    return htmltxt

def get_Info(htmltxt,ua):
    soup=BeautifulSoup(htmltxt,'lxml')
    #CSS选择器
    #name
    name=soup.select('body>div:nth-of-type(3)>div:nth-of-type(1)>div:nth-of-type(1)>div.article>ol>li>div.item>div.info>div.hd>a')
    #director and actor
    dir_and_actor=soup.select('body>div:nth-of-type(3)>div:nth-of-type(1)>div:nth-of-type(1)>div.article>ol>li>div.item>div.info>div.bd>p:nth-of-type(1)')
    #point
    review_point=soup.select('body>div:nth-of-type(3)>div:nth-of-type(1)>div:nth-of-type(1)>div.article>ol>li>div.item>div.info>div.bd>div.star>span:nth-of-type(2)')
    #reviewnum
    review_num= soup.select('body>div:nth-of-type(3)>div:nth-of-type(1)>div:nth-of-type(1)>div.article>ol>li>div.item>div.info>div.bd>div.star>span:nth-of-type(4)')


    #子网页获取评论
    name_review = soup.select('body>div:nth-of-type(3)>div:nth-of-type(1)>div:nth-of-type(1)>div.article>ol>li>div.item>div.info>div.hd>a')
    get_review_web = []
    for i in range(0,10):
       get_review_web.append(name_review[i].get('href'))


    #数据清洗集和
    list_result=[]
    name=clear_info_name(name)
    dir_and_actor=clear_info_dir(dir_and_actor)
    review_num=clear_info_name(review_num)
    review_point=clear_info_name(review_point)
    for i in range(0,len(name)):
       list_result.append(str(name[i])+"|"+str(dir_and_actor[i])+"|"+str(review_point[i])+"|"+str(review_num[i]))

    return [list_result,get_comments(get_review_web,ua)]

def clear_info_name(lst):
    list=[]
    for i in range(0,len(lst)):
        record=lst[i].get_text().split('/')[0]
        record= "".join(record.split())
        list.append(record.strip())
    return list

def clear_info_dir(lst):
    list=[]
    for i in range(0,len(lst)):
        record=(lst[i].get_text()).split('/')
        Str=""
        for i in range(0,len(record)):
            Str+=record[i]
        Str= "".join(Str.split())
        list.append(Str.strip())
    return list


def get_comments(lst,ua):
    output=[]
    for i in range(0,len(lst)):
        htmltxt=page_request(lst[i]+"reviews",ua)
        soup=BeautifulSoup(htmltxt,'lxml')
        result=soup.select('body>div:nth-of-type(3)>div:nth-of-type(1)>div:nth-of-type(1)>div:nth-of-type(1)>div:nth-of-type(1)>div>div:nth-of-type(1)>div:nth-of-type(1)>div:nth-of-type(1)>div:nth-of-type(1)')
        for i in range(0, len(result)):
            record = result[i].get_text().strip().replace('\n','|')
            record=record.split('|')[0]
            output.append(record)
    return output

#写入到txt
def save_part(lst):
    fp=open('D:\\info.txt', 'w', encoding='utf-8')
    for i in range(0,len(lst)):
        fp.write(lst[i]+'\n')
    fp.close()
def save_review(lst):
    fp = open('D:\\reviews.txt', 'w', encoding='utf-8')
    for i in range(0, len(lst)):
        fp.write(lst[i] + '\n')
    fp.close()


if __name__=='__main__':
    ua= {}
    list=[]
    list_result=[]
    comments=[]
    for i in range(0,11):
        url='https://movie.douban.com/top250?start='+str(time1)+'&filter='
        list=get_Info(page_request(url,ua),ua)
        list_result+=list[0]
        comments+=list[1]
        time1+=25

    save_review(comments)
    save_part(list_result)

使用python3.6，更高版本会存在方法警告。
程序未给出请求标头。
使用CSS获取的内容进行了初步清洗，需要建议再次修改。
通过获取子网页url再次访问，获取评论，评论未截全（待修改）。
关于<br>非成对标签，通过查询以字符替换的方式解决（即将<br>,替换为“”）。

标签：review,list,爬虫,nth,lst,div,type,选择器,CSS
From： https://www.cnblogs.com/smith-count/p/17474109.html

使用Xpath编写爬虫代码
Xpath选择器爬取房源信息实例获取网页html,未处理子网页信息。python3.6foriinrange(1,101):print('正在爬取第'+str(i)+'页')#爬取北京上海广州深圳的二手房信息city=['bj','sh','gz','sz']forcincity:......
CSS3
一、CSS简介CSS是层叠样式表(CascadingStyleSheets)的简称。有时我们也会称之为CSS样式表或级联样式表。CSS是也是一种标记语言，主要用于设置HTML页面中的文本内容（字体、大小、对齐方式等）、图片的外形（宽高、边框样式、边距等）以及版面的布局和外观显示样式。CSS让我......
Python爬虫
目录PythonSpider第一章爬虫入门1.1爬虫概述1.1.1爬虫原理1.1.2爬虫分类1.1.3爬虫应用1.2爬虫流程1.2.1爬取网页1.2.2解析网页1.2.3存储数据1.3爬虫协议1.3.1Robots协议1.3.2robots.txt文件简介1.3.3robots.txt文件详解1.3.4爬虫准则1.4爬虫环境1.4.1原生Python+......
Laravel 框架使用外部的js、css等文件
Laravel框架使用外部的js、css等文件阅读有道云笔记https://note.youdao.com/s/d1ZQ9AC8Laravel项目的web虚拟主机指定的目录（即网址的根目录），项目的入口文件笔系统的静态资源目录（css、img、js、uploads）后期使用的外部静态文件都需要放到Public目录下，图中所示，可以想像成views......
CSS: scrollTop scrollLeft
<!DOCTYPEhtml><htmllang="en-US"><head><metacharset="UTF-8"><metaname="viewport"content="width=device-width,initial-scale=1.0"><title>Document</tit......
盘点一个Python网络爬虫问题
大家好，我是皮皮。一、前言前几天在Python最强王者群【刘桓鸣】问了一个Python网络爬虫的问题，这里拿出来给大家分享下。他自己的代码如下：importrequestskey=input("请输入关键字")res=requests.post(url="https://jf.10086.cn/cmcc-web-shop/search/query",data=......
CSS组合器(Combinators)
前言组合器就是将选择器按照一定的语法规则进行组合，提供更丰富的元素选择方案。选择器主要分为类型选择器属性选择器类选择器ID选择器通配选择器组合器类型选择器列表（Selectorlist）如果你希望多个元素共享同一样式，可以使用该组合器，其有点类似并操作（or)。语法和示例......
Python爬虫--BOSS直聘网Python相关职业招聘信息
一、选题的背景为什么要选择此选题？要达到的数据分析目标是什么？从社会、经济、技术、数据来源等方面进行描述（200字以内）（10分）最近Python大热，Python在数据分析、后端开发、人工智能、运维、全栈开发等多方面都具有得天独厚的优势。在一些行业爬虫工程师，人工智能，爬虫工程......
Python网络爬虫对汽车团购报名的爬取及分析
一、选题背景现如今汽车已逐步进入家庭中，对于一些准备购入新车的家庭，犹豫不决，不知道现在市场上与车友们推荐的哪些车，此次爬虫项目将对网上的团购排名进行爬取，更能简单直观的让大家依据个人情况来挑选自己中意的车辆详情。二、设计方案1.主题式网络爬虫名称《python网络......
python爬虫——深圳市租房信息数据分析
一、选题背景因为深圳经济非常不错，想必想要去深圳工作的人也不少。衣食住行是生活的基本需求。衣和食好解决，不喜欢的衣服可以买新的，不好吃的食物可以换一家吃。可是在住宿上，买房和租房的置换成本都相对较高，因此房源选择尤为慎重。作为目前买不起房的人自然是以租房为主，但是租房我......

CSS选择器——简单爬虫代码

爬取豆瓣top250影片资料（待修改）

相关文章

赞助商

阅读排行