首页 > 其他分享 >新闻文本爬取——以央广网为例

新闻文本爬取——以央广网为例

时间:2023-02-05 22:14:01浏览次数:48  
标签:以央广网 newses 为例 href 爬取 url html news hrefs

目录

crawling

crawling1.x

crawling1.0

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        return e


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def main():
    url = "https://military.cnr.cn/jdt/20230203/t20230203_526143652.shtml"
    html = get_html_text(url)
    news = parse_news_page(html)
    print(news)
    news = pd.Series(news)
    news.to_csv("news.csv", header=False, index=False)


main()

项目内容:根据一篇新闻的url链接,爬取新闻文本。
技术路线:使用Beautifulsoup库,找到这个网页title标签和所有p标签,将文本存入列表,最后保存为CSV格式的文件。

crawling2.x

crawling2.0

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        return e


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def parse_href_page(html, hrefs):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all("a")
    for tag in tags:
        href = tag.attrs["href"]
        if "shtml" == href[-5:]:
            hrefs.append(href)
    return hrefs


def main():
    hrefs = []
    newses = []
    url = "http://military.cnr.cn/"
    html = get_html_text(url)
    parse_href_page(html, hrefs)
    for href in hrefs:
        html = get_html_text(href)
        news = parse_news_page(html)
        print(news)
        newses.append(news)
    newses = pd.Series(newses)
    newses.to_csv("newses.csv", header=False, index=False)


main()

项目内容:爬取一个网页的新闻链接,进而获取这些链接的新闻文本。
技术路线:使用Beautifulsoup库,找到a标签的href,留下"shtml"==href[-5:]的链接。
待解决:爬取的新闻文本,有多处,连着2篇是一样的新闻文本。

crawling2.1

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        return e


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def parse_href_page(html, hrefs):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all("a")
    for tag in tags:
        href = tag.attrs["href"]
        if "shtml" == href[-5:] and href not in hrefs:
            hrefs.append(href)
    return hrefs


def main():
    hrefs = []
    newses = []
    url = "http://military.cnr.cn/"
    html = get_html_text(url)
    parse_href_page(html, hrefs)
    for href in hrefs:
        html = get_html_text(href)
        news = parse_news_page(html)
        print(news)
        newses.append(news)
    newses = pd.Series(newses)
    newses.to_csv("newses.csv", header=False, index=False)


main()

改进:新闻链接,不收集已在hrefs中的。

if "shtml" == href[-5:] and href not in hrefs:
    hrefs.append(href)

crawling3.x

crawling3.0

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        return e


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def parse_href_page(html, hrefs):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all("a")
    for tag in tags:
        href = tag.attrs["href"]
        if "shtml" == href[-5:] and href not in hrefs:
            hrefs.append(href)
    return hrefs


def get_newses(url, newses):
    hrefs = []
    html = get_html_text(url)
    parse_href_page(html, hrefs)
    for href in hrefs:
        html = get_html_text(href)
        news = parse_news_page(html)
        print(news)
        newses.append(news)


def main():
    newses = []
    urls = ["http://finance.cnr.cn/", "http://tech.cnr.cn/", "http://food.cnr.cn/",
            "http://health.cnr.cn/", "http://edu.cnr.cn/", "http://travel.cnr.cn/",
            "http://military.cnr.cn/", "http://auto.cnr.cn/", "http://house.cnr.cn/",
            "http://gongyi.cnr.cn/"]
    for url in urls:
        print(url)
        get_newses(url, newses)
    newses = pd.Series(newses)
    newses.to_csv("newses.csv", header=False, index=False)


main()

项目内容:爬取10个网页的新闻链接,进而获取这些链接的新闻文本。
技术路线:遍历10个网页链接。
待解决:爬取的新闻文本中,部分新闻出现乱码;部分新闻链接不完整;部分新闻存在空格、Tab、换行等空白字符。

crawling3.1

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)
        print(url)
        return url


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def parse_href_page(html, hrefs):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all("a")
    for tag in tags:
        href = tag.attrs["href"]
        if "shtml" == href[-5:] and href not in hrefs:
            hrefs.append(href)
    return hrefs


def get_newses(url, newses):
    hrefs = []
    html = get_html_text(url)
    parse_href_page(html, hrefs)
    for href in hrefs:
        html = get_html_text(href)
        if html == href:
            continue
        news = parse_news_page(html)
        # print(news)
        newses.append(news)


def main():
    newses = []
    urls = ["http://finance.cnr.cn/", "http://tech.cnr.cn/", "http://food.cnr.cn/",
            "http://health.cnr.cn/", "http://edu.cnr.cn/", "http://travel.cnr.cn/",
            "http://military.cnr.cn/", "http://auto.cnr.cn/", "http://house.cnr.cn/",
            "http://gongyi.cnr.cn/"]
    for url in urls:
        print(url)
        get_newses(url, newses)
    newses = pd.Series(newses)
    newses.to_csv("newses.csv", header=False, index=False)


main()

改进:添加encoding="utf-8"(未解决问题,乱码仍存在);考虑到工作量,去掉了不正常的新闻链接(保存的全都是新闻文本了)。

def get_html_text(url):
    except Exception as e:
        return url
def get_newses(url, newses):
    for href in hrefs:
        html = get_html_text(href)
        if html == href:
            continue

crawling3.2

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)
        print(url)
        return url


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def parse_href_page(html, hrefs):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all("a")
    for tag in tags:
        href = tag.attrs["href"]
        if "shtml" == href[-5:] and href not in hrefs:
            hrefs.append(href)
    return hrefs


def get_newses(url, newses, labels, count):
    hrefs = []
    html = get_html_text(url)
    parse_href_page(html, hrefs)
    for href in hrefs:
        html = get_html_text(href)
        if html == href:
            continue
        news = parse_news_page(html)
        # print(news)
        newses.append(news)
        labels.append(count)


def main():
    newses = []
    labels = []
    urls = ["http://finance.cnr.cn/", "http://tech.cnr.cn/", "http://food.cnr.cn/",
            "http://health.cnr.cn/", "http://edu.cnr.cn/", "http://travel.cnr.cn/",
            "http://military.cnr.cn/", "http://auto.cnr.cn/", "http://house.cnr.cn/",
            "http://gongyi.cnr.cn/"]
    count = 0
    for url in urls:
        print(url)
        get_newses(url, newses, labels, count)
        count += 1
    newses = pd.DataFrame({"label": labels, "text": newses})
    newses.to_csv("newses.csv", index=False)


main()

项目内容:为完成机器学习任务新闻文本分类,为获取的新闻加标签。
疑问:不加标签时,一个列表转成Series存为CSV,有的新闻中间有换行符,导致存储后占了3行,但read_csv()读入后,新闻顺序没出错。

crawling3.3

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)
        print(url)
        return url


def parse_news_page(html):
    try:
        ilt = []
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find("title").string
        ilt.append(title)
        content = soup.find_all("p")
        for p in content:
            s = p.text.strip()
            s = "".join(s.split("\n"))
            ilt.append(s)
        news = "".join(ilt)
        return news
    except Exception as e:
        return e


def parse_href_page(html, hrefs):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all("a")
    for tag in tags:
        href = tag.attrs["href"]
        if "shtml" == href[-5:] and href not in hrefs:
            hrefs.append(href)
    return hrefs


def get_newses(url, newses, labels, count):
    hrefs = []
    html = get_html_text(url)
    parse_href_page(html, hrefs)
    for href in hrefs:
        html = get_html_text(href)
        if html == href:
            continue
        news = parse_news_page(html)
        # print(news)
        newses.append(news)
        labels.append(count)


def main():
    newses = []
    labels = []
    urls = ["http://finance.cnr.cn/", "http://tech.cnr.cn/", "http://food.cnr.cn/",
            "http://health.cnr.cn/", "http://edu.cnr.cn/", "http://travel.cnr.cn/",
            "http://military.cnr.cn/", "http://auto.cnr.cn/", "http://house.cnr.cn/",
            "http://gongyi.cnr.cn/"]
    count = 0
    for url in urls:
        print(url)
        get_newses(url, newses, labels, count)
        count += 1
    newses = pd.DataFrame({"label": labels, "text": newses})
    newses.to_csv("newses.csv", index=False)


main()

改进:将新闻文本中的换行符去掉,使用s="".join(s.split("\n"))。

标签:以央广网,newses,为例,href,爬取,url,html,news,hrefs
From: https://www.cnblogs.com/yymqdu/p/17094059.html

相关文章

  • 如何重装Windows系统——以Windows10为例
    写在重装前重装前注意备份系统盘(一般是C盘)中的数据你需要一个U盘可以把操作系统看做成一个软件软件运行的时候无法删除软件一般情况下系统盘是C盘步骤重装系统主......
  • 64爬取b站,微博,ai问答等数据写入excel
    #功能1:获取手机号归属地#功能2:查询天气#功能3:查询百度热搜#功能4:查询微博热搜#功能5:查询b站#功能6ai问答(在这用不了涉及网站逆向写在另外一个py模块,没写入到......
  • 淘宝天猫商品详情爬取方案app详情sku数据如何获取?
    背景商品详情包含了非常多的数据,如sku、价格、库存、店铺名称、店铺logo、开店时间、旺旺、主图、标题等等,很多行业都有需要,比如电商相关行业、淘客、电商软件等都需要用到......
  • 17.爬取天天基金中万家精选混合A (519185)的净值数据
    1#爬虫2#该项目是爬取天天基金网某只基金的净值数据34#1.引入包5#网络请求6importjson78importrequests9#正则10importre11#数据......
  • 干货|以Vue为例,如何提升小程序开发效率?
    小程序的交付过程是这样的:一般小程序从idea到发布,安装小程序开发者工具→新建模板小程序→开发→编译→发布,且整个过程为可视化操作,只需写核心逻辑代码即可。小程序框架本......
  • flutter:安装使用第三方库:以dio为例(flutter 3.7.0 / dio 4.0.6)
    一,dio库的地址:国外:https://pub.dev/packages/dio国内:https://pub.flutter-io.cn/packages/dio如图:可以看到最新版本是4.0.6说明:刘宏缔的架构森林是一个......
  • 判断网站是否更新数据(tp6项目为例)
    1.获取数据库中所有数据表的条数累加去判断(多项目时)2.选取新闻表和栏目表,获取总条数,与下次作比较判断(单项目时)建议选取第一种:3.连接数据库代码:functiondatabasecon......
  • ython爬取异步ajax数据
      使用selenium这个网页自动化测试工具,去获得源码。因为这个工具是等到页面加载完成采取获取的整个页面的代码,所以理论上是可以获得页面完整数据的。 defreque......
  • 网络爬虫(三)爬取B站视频
    尝试着用request库去爬取了一些B站视频参考和抄了一些博主的代码和思路,我略作了修改,目前是不能爬取带分页的视频(只需要略作修改,也可爬取):b站视频爬虫_哔哩哔哩_bilibiliP......
  • gRPC介绍(以Java为例)
    1.简介1.1gRPC的起源RPC是RemoteProcedureCall的简称,中文叫远程过程调用。用于解决分布式系统中服务之间的调用问题。通俗地讲,就是开发者能够像调用本地方法一样调用......