爬虫实战

爬虫实战

时间：2024-01-22 23:33:09浏览次数：46

标签：实战 text 爬虫 topic lst time answer click

静态网页的爬取

静态网页的爬取

以爬取https://hongdou.gxnews.com.cn/viewforum-21-1.html这个静态网页的南宁论坛数据为例

数据表的设计：

from peewee import *

db = MySQLDatabase("spider", host="127.0.0.1", port=3306, user="root", password="123456")


class BaseModel(Model):
    class Meta:
        database = db


class Topic(BaseModel):
    topic_id = IntegerField(primary_key=True)   # 主题id，主键
    title = CharField()                         # 主题标题
    author = CharField()                        # 主题作者
    publish_time = DateField()                  # 发表时间
    click_num = IntegerField(default=0)         # 点击数
    answer_num = IntegerField(default=0)        # 回复数量
    final_answer_author = CharField()           # 最后回复作者
    final_answer_time = DateTimeField()         # 最后回复时间


if __name__ == '__main__':
    #   创建表结构
    db.create_tables([Topic])

单线程版本，脚本如下：

import re
import time
from datetime import datetime

import requests
from scrapy import Selector

from models import *

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
}


def parse_url(url):
    res = requests.get(url, headers=headers)
    res.encoding = 'gb2312'    #该论坛静态页面编码为gb2312
    html_text = res.text

    sel = Selector(text=html_text)
    items = sel.xpath('//div[@class="threadbit1"]')
    for item in items:
        title_lst = item.xpath(".//div[@class='thread-row openTitle']/a/font/text()").extract()
        if title_lst:
            title = title_lst[0].strip()
        author_lst = item.xpath(".//div[4]/a[1]/text()").extract()
        if (author_lst):
            author = author_lst[0]

        publish_time_lst = item.xpath(".//div[4]/a[2]/text()").extract()
        if (publish_time_lst):
            publish_time = publish_time_lst[0]
        publish_time = datetime.strptime(publish_time, r'%Y-%m-%d')
        click_answer_lst = item.xpath(".//div[@style='float:right;width:90px;']/text()").extract()
        if click_answer_lst:
            click_answer_str = click_answer_lst[0].strip()
        click_answer_str = click_answer_str.split('/')
        answer_num = int(click_answer_str[0])
        if click_answer_str[1] == '':
            click_num = int(item.xpath(".//div[3]/font/text()").extract()[0])
        else:
            click_num = int(click_answer_str[1])
        final_answer_author = item.xpath(".//div[2]/a[1]/text()").extract()[0]
        final_answer_time = item.xpath(".//div[2]/a[2]/text()").extract()
        if final_answer_time:
            final_answer_time = final_answer_time[0]
        final_answer_time = datetime.strptime(final_answer_time, r'%Y-%m-%d %H:%M')

        id_lst = item.xpath(".//div[@class='thread-row openTitle']/a/@href").extract()
        topic = Topic()
        if id_lst:
            id = int(re.search(r'(\d+)', id_lst[0]).group(1))
            topic.topic_id = id

        topic.title = title
        topic.author = author
        topic.publish_time = publish_time
        topic.click_num = click_num
        topic.answer_num = answer_num
        topic.final_answer_author = final_answer_author
        topic.final_answer_time = final_answer_time
        existed_topics = Topic.select().where(Topic.topic_id == topic.topic_id)
        if existed_topics:
            topic.save()
        else:
            topic.save(force_insert=True)

        print("start download topic: " + str(topic.topic_id))

        time.sleep(1)

if __name__ == "__main__":
    res = requests.get("https://xxx/viewforum-21-1.html", headers=headers)
    res.encoding = 'gb2312'
    html_text = res.text

    sel = Selector(text=html_text)
    # 获取总页数
    td_str = sel.xpath("//div[@class='pagenav']//td[@class='pagenav']/text()").extract()[0]
    match = re.search(r'(\d+/\d+)', td_str)
    if match:
        total_page = int(match.group(1).split('/')[1])

    # total_page = 1
    for i in range(0, total_page):
        parse_url("https://xxx/viewforum-21-{0}.html".format(i+1))

标签：实战,text,爬虫,topic,lst,time,answer,click
From： https://www.cnblogs.com/xiaocer/p/17981394

【K哥爬虫普法】倒计时21天！事关爬虫er们能否平安回家过年！
我国目前并未出台专门针对网络爬虫技术的法律规范，但在司法实践中，相关判决已屡见不鲜，K哥特设了“K哥爬虫普法”专栏，本栏目通过对真实案例的分析，旨在提高广大爬虫工程师的法律意识，知晓如何合法合规利用爬虫技术，警钟长鸣，做一个守法、护法、有原则的技术人员。事出有因 ......
优化Elastic Load Balancing负载均衡算法的实战指南
在AWS中，ElasticLoadBalancing（ELB）服务是实现负载均衡的关键组件，而TargetGroups则用于管理和路由传入的流量。本篇博文将深入介绍如何通过Boto3（AWSSDKforPython）和ELBv2API来优化TargetGroup的负载均衡算法，以提高系统性能。我们将实现将所有符合条件的TargetGroup的负载均衡......
分享一份适合练手的软件测试实战项目
最近，不少读者托我找一个能实际练手的测试项目。开始，我觉得这是很简单的一件事，但当我付诸行动时，却发现，要找到一个对新手友好的练手项目，着实困难。我翻了不下一百个web网页，包括之前推荐练手的网站（普遍会有bug），但依旧没能找到合适的。最后，在我苦苦的搜寻过程中，突然蹦跶出来一条引......
selenium自动化测试实战
selenium自动化测试实战一、Selenium介绍Selenium是什么？一句话，自动化测试工具。它支持各种浏览器，包括Chrome，Safari，Firefox等主流界面式浏览器，如果你在这些浏览器里面安装一个Selenium的插件，那么便可以方便地实现Web界面的测试。Selenium2，又名WebDriver，它的......
Scrapy爬虫框架
网络爬虫框架：ScrapyScrapy是一个非常优秀的爬虫框架，通过Scrapy框架，可以非常轻松的实现强大的爬虫系统。一、Scrapy简介Scrapy主要包括如下6个部分：ScrapyEngine：用来处理整个系统的数据流，触发各种事件。Scheduler：从URL队列中取出一个URL。Downloader：从internet上下载web资源......
spring扩展点之InitializingBean接口实战
前言：InitializingBean接口让我们可以干涉bean的初始化过程，算是spring给我们提供的一个扩展点咯，凡是继承InitializingBean接口的类，在创建bean的时候都会执行afterPropertiesSet方法，在这个方法里面我们可以写一些自己的业务逻辑。文章概括实战场景之把配置类java代码化测试答疑实战......
记忆函数的实战应用
力扣2623.记忆函数今天在力扣做了一道题：使用JavaScript实现记忆函数，所谓记忆函数就是一个对于相同的输入永远不会被调用两次的函数。相反，它将返回一个缓存值。以下是使用哈希表实现的方法：/***@param{Function}fn*@return{Function}*/functionmemoize(fn){co......
Python实战：selenium模拟浏览器运行，获取软科网站2023中国大学排名
Python实战：selenium模拟浏览器运行，获取软科网站2023中国大学排名在爬取一些加密的网页时，可以使用selenium模拟浏览器运行，再从网页中提取想要的数据。使用的库本文使用到的Python库有：selenium、bs4、pandas使用selenium解决网页的反爬使用bs4对html网页进行解析和提取数据......
Ingress企业实战：部署多个Ingress控制器篇
背景在大规模集群场景中，部分服务需要通过公网Ingress对外提供服务访问，但是有部分服务只对内提供服务，不允许使用公网访问，仅支持内部服务间调用，此时可以通过部署两套独立的Ingress来实现，一套支持公网访问，一套仅支持内网访问。接下来，我们通过最佳实践进行实现喽！架构图最佳实践说明......
网络爬虫-----爬虫的分类及原理
爬虫的分类网络爬虫按照系统结构和实现技术，大致可分为4类，即通用网络爬虫、聚焦网络爬虫、增量网络爬虫和深层次网络爬虫。1.通用网络爬虫：搜索引擎的爬虫比如用户在百度搜索引擎上检索对应关键词时，百度将对关键词进行分析处理，从收录的网页中找出相关的再根据一定的排名规则进行排......

静态网页的爬取

相关文章

赞助商

阅读排行