04. 中国招标投标公众服务平台

时间：2022-11-28 09:14:11浏览次数：36

标签：__ xpath 04 title self replace 服务平台投标 data

中国招标投标公众服务平台 https://bulletin.cebpubservice.com/

import httpx
import warnings
from lxml import etree
from scrapy.exporters import CsvItemExporter
from loguru import logger

warnings.filterwarnings("ignore")


class Spider:

    def __init__(self):
        self.headers = {

            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"
        }

        # 文件存储初始化操作
        self.file = open(f'data.csv', 'wb')
        self.exporter = CsvItemExporter(file=self.file, include_headers_line=True, encoding='gbk')
        self.exporter.start_exporting()

    def request_url(self):

        for i in range(1, 500 + 1):  # 翻页
            url = f'https://bulletin.cebpubservice.com/xxfbcmses/search/bulletin.html?searchDate=1997-10-26&dates=300&word=&categoryId=88&industryName=&area=&status=&publishMedia=&sourceInfo=&showStatus=1&page={i}'
            try:
                # 发起请求
                response = httpx.get(url=url, headers=self.headers, verify=False, timeout=30)
                self.parse_data(response)
            except BaseException as e:
                print(e)

    def parse_data(self, response):
        global num
        if response.status_code == 200:
            result = response.content.decode('utf-8')

            html = etree.HTML(result)
            data_list = html.xpath('//body/table[@class="table_text"]/tr[position()>1]')

            for data in data_list:
                num += 1  # 采集计数
                # 招标公告
                title = ''.join(data.xpath('./td[1]/a/@title')).replace('•', '-').replace('\xb3', '').replace('\u2082', '')
                # 公告连接
                title_url = ''.join(data.xpath('./td[1]/a/@href')).replace('\t', '')[20:-2].replace('\n', '')
                # 行业所属
                bussine = ''.join(data.xpath('./td[2]/span/text()')).replace('\t', '').replace('\n', '').replace('\r','')
                # 地区
                address = ''.join(data.xpath('./td[3]/span/@title'))
                # 渠道
                tools = ''.join(data.xpath('./td[4]/text()')).replace('\t', '').replace('\n', '')
                # 发布时间
                times = ''.join(data.xpath('./td[5]/text()')).replace('\t', '').replace('\n', '').replace('\r', '')
                # 开标时间
                open_time = ''.join(data.xpath('./td[6]/@id'))

                # 数据封装
                dict_data = {
                    'num': num,
                    'title': title,
                    'title_url': title_url,
                    'bussine': bussine,
                    'address': address,
                    'tools': tools,
                    'times': times,
                    'open_time': open_time,
                }

                logger.info('{}', dict_data)
                self.exporter.export_item(dict_data)

    def __del__(self):
        self.exporter.finish_exporting()
        self.file.close()


if __name__ == '__main__':
    # 计数
    num = 0
    s = Spider()
    s.request_url()

标签：__,xpath,04,title,self,replace,服务平台,投标,data
From： https://www.cnblogs.com/modly/p/16931289.html

04.引用与借用
引用和借用参数的类型是&String而不是String&符号就表示引用:允许你引用某些值而不取得其所有权借用我们把引用作为函数参数这个行为叫做借用。是否可以修改借......
rac dg活动复制完成后，备库节点1查询数据库状态时报错ORA-00204、ORA-00202，且告警日志
问题描述：racdg活动复制完成后，备库节点1查询数据库状态时报错ORA-00204、ORA-00202，且告警日志中出现ORA-15025、ORA-27041异常，如下所示：说明：racdg磁盘组采用的是多路径+ud......
最完美WIN10_Pro_22H2.19045.2311软件选装纯净版VIP37.6
【系统简介】==============================================================1.本次更新母盘来UUP_WIN10_Pro_22H2.19045.2311。2.不支持更新，更新后有些东西又会回来，玩过......
spring gateway路由出现503、404错误解决方法
查资料发现在网关出现503错误是因为全局过滤器没有加载（ReactiveLoadBalancerClientFilter），只需要将含有这个过滤器的依赖进行导入就行了<dependency><groupId>org.......
# ubuntu 22.04更换阿里源
sudovim/etc/apt/sources.list按d删除所有行的内容复制替换内容debhttp://mirrors.aliyun.com/ubuntu/jammymainrestricteduniversemultiversedeb-srchttp......
004 如何学习好ArcObject SDK开发
1、基于ArcobjectsSDK可以做什么基于ArcobjectsSDK开发，大部分情况下就是做桌面GIS应用程序。AO写的代码是不能直接在Web服务上运行的，但如果你前端是JS，需要后端处理数据，也......
2022-2023-1 20221404 《计算机基础与程序设计》第十三周学习总结
2022-2023-120221404《计算机基础与程序设计》第十三周学习总结作业信息班级链接（2022-2023-1-计算机基础与程序设计）作业要求（2022-2023-1计算机基础与程序设......
ubuntu22.04 rc0.d rc1.d rc2.d rc3.d rc4.d rc5.d rc6.d不同级别下设置开机自启动脚
1.输入runlevel查看操作系统运行级别操作系统级别为5，就需要进入rc5.d中建立软链接 2.在/etc/init.d/中创建需要启动的.sh文件如file_au......
TypeScript学习笔记-04 tsconfig.json配置文件
tsconfig.json一般常用的配置如下所示，可以按需要进行配置。{/*tsconfig.json是ts编译器的配置文件，ts编译器可以根据他的信息来对代码进行编译//in......
20221304 《计算机基础与程序设计》第十三周学习总结
2022-2023-120221304《计算机基础与程序设计》第十一周学习总结作业信息这个作业属于哪个课程https://edu.cnblogs.com/campus/besti/2022-2023-1-CFAP这个作......

04. 中国招标投标公众服务平台

相关文章

赞助商

阅读排行