Python如何快速实现新闻采集

时间：2023-01-04 10:02:01浏览次数：43

标签：__ links Python 新闻采集 html link print news

作为一名资深技术员，我是经常使用爬虫工具来实现目标快速采集，下面就是我通过Python语言快速采集新闻的代码以及解释，希望能够给大家带来帮助。

直接上代码：

#!/usr/bin/env python3
# Author: veelion

import re
import time
import requests
import tldextract

def save_to_db(url, html):
    # 保存网页到数据库，我们暂时用打印相关信息代替
    print('%s : %s' % (url, len(html)))

def crawl():
    # 1\. download baidu news
    hub_url = 'http://news.baidu.com/'
    res = requests.get(hub_url)
    html = res.text

    # 2\. extract news links
    ## 2.1 extract all links with 'href'
    links = re.findall(r'href=[\'"]?(.*?)[\'"\s]', html)
    print('find links:', len(links))
    news_links = []
    ## 2.2 filter non-news link
    for link in links:
        if not link.startswith('http'):
            continue
        tld = tldextract.extract(link)
        if tld.domain == 'baidu':
            continue
        news_links.append(link)

    print('find news links:', len(news_links))
    # 3\. download news and save to database
    for link in news_links:
        html = requests.get(link).text
        save_to_db(link, html)
    print('works done!')

def main():
    while 1:
        crawl()
        time.sleep(300)

if __name__ == '__main__':
    main()

简单解释一下上面的代码：

1、使用requests下载百度新闻首页；

2、先用正则表达式提取a标签的href属性，也就是网页中的链接；然后找出新闻的链接，方法是：假定非百度的外链都是新闻链接；

3、逐个下载找到的所有新闻链接并保存到数据库；保存到数据库的函数暂时用打印相关信息代替。

4、每隔300秒重复1-3步，以抓取更新的新闻。

以上代码能工作，但也仅仅是能工作，槽点多得也不是一点半点，那就让我们一起边吐槽边完善这个爬虫吧。

标签：__,links,Python,新闻,采集,html,link,print,news
From： https://blog.51cto.com/u_13488918/5987015

Python爬虫知识点之模块作用
上一篇文中我们用到了Python的几个模块做了百度新闻的爬取，这些模块他们在爬虫中的作用如下：1、requests模块它用来做http网络请求，下载URL内容，相比Python自带的urllib.reque......
python读取文件，如果未找到，可以抛错处理
defmain():f=open('致橡树.txt','r',encoding='utf-8')print(f.read())f.close()deftest():f=Nonetry:f=open("致橡树.svb",'r......
Python学习day01
一、python介绍 ①发展史被解救的姜戈2.450万行Python2.6-October1,2008Python2.6.1-October1,2008Python2.6.6-October1,2008Python3.0-December3......
PostgreSQL citus python环境搭建
PostgreSQLcituspython环境搭建精选原创Janeh10182022-01-0809:19:09博主文章分类：PostgreSQL©著作权文章标签sqlpostgresql数据库文章分类其它数据库阅读数27......
Python 数据类型详细篇：字符串
Python基本数据类型中的字符串类型，字符串类型在实际的开发中是一个经常会用到的数据类型，比较重要。下面我们一起来看一下：1.简介字符串类型的数据表示一段文本，使用单引号......
Python转义字符理解
#Author：符攀飞#Blog：feifeige.top#Date:2023/1/322:09#File:day02.py#转义字符print('hello\nworld')#\n换行print('hello\tworld')#t占四个,hello多......
用Python批量绘制二维矩阵
importnumpyasnpfrommatplotlibimportpyplotaspltimportmatplotlibasmplimportglobdefcreate_4_colorMap():#colors=['blue','cyan','green','p......
创建python虚拟环境
安装pipinstallvirtualenvcd到指定目录virtualenv目录名--python==python3.7 ---如何激活python进入scripts目录，执行该文件 ---给虚拟环境安装dj......
转载自ChatGPT：Python关键字 asynico
同步和异步同步和异步是指程序的执行方式。在同步执行中，程序会按顺序一个接一个地执行任务，直到当前任务完成。而在异步执行中，程序会在等待当前任务完成的同时，执行其他任务......
Python常见设置
pip的相关设置设置镜像为pip设置国内的镜像源可以提高Python库下载的速度，这里推荐使用清华大学的镜像站，使用如下命令配置：python-mpipinstall--upgradepippipconf......

Python如何快速实现新闻采集

相关文章

赞助商

阅读排行