首页 > 编程语言 >python爬虫——站酷数据获取

python爬虫——站酷数据获取

时间:2022-11-27 03:44:31浏览次数:44  
标签:get python self 爬虫 datetime proxy https import 站酷

1.站酷数据获取

其中需要注意,本文使用了IP代理,以及不同的作品类型,他们详情页url拼接方式不同

import random
import time
import datetime
import requests
import threading
from lxml import etree
import pymysql

class ZhankunSpider(object):
    def __init__(self):
        self.url = 'https://www.zcool.com.cn/p1/discover/first?p={}&ps=20'
        self.mysql = pymysql.connect(host='localhost', database='tenders', port=3306, user='root',
                                         password='123456')
        self.cur = self.mysql.cursor()
        self.blog = 1


    def proxy_get(self):
        procy = requests.get(
            r'放入IP代理链接').json()['data'][0]
        proxy = str(procy["ip"]) + ':' + str(procy["port"])
        http = 'http://' + proxy
        https = 'https://' + proxy
        self.proxys = {'http': http,
                           'https': https}
        print(self.proxys)
            # result = requests.get('https://www.baidu.com/',verify=False)
        result = requests.get('https://www.baidu.com/')
        print(result.status_code)
        if result.status_code != 200:
            self.proxy_get()
            time.sleep(0.2)
            return
            # self.expire_datetime = datetime.datetime.now() + datetime.timedelta(seconds=60)

    def _check_expire(self):
        self.expire_datetime = datetime.datetime.now() + datetime.timedelta(seconds=60)
        if datetime.datetime.now() >= self.expire_datetime:
            self.proxy_get()

        # 发送请求
    def get_html(self, url):
        if self.blog <= 3:
            try:
                datas = {
                        'p': 'i',
                        'column': 5
                }
                headers = {'Cookie': '登陆后cookie',
                            'User-Agent':'',}
                json_ids = requests.get(url=url, headers=headers, data=datas).json()
                return json_ids
            except Exception as e:
                print(e)
                self.blog += 1
                self.get_html(url)

        # 解析提取数据
    def parse_html(self, url):
        json_ids = self.get_html(url)
        self._check_expire()
        if json_ids:
            time.sleep(1)
            for dic in json_ids['datas']:
                titles = dic['content']['title']  #题目
                types = dic['content']['typeStr']
                viewCountStrs = dic['content']['viewCountStr']   #浏览量
                subCateStrs = dic['content']['subCateStr']
                cateStrs = dic['content']['cateStr']
                url13 = 'https://www.zcool.com.cn/p1/product/'+dic['content']['idStr']
                urll = dic['content']['pageUrl']
                headers1 = {
                    'Cookie': '',
                    'User-Agent': '', }

                # self._check_expire()
                if 'work' in urll:
                    url2 = 'https://www.zcool.com.cn/p1/product/' + dic['content']['idStr']
                    try:
                        json_idss = requests.get(url=url2, headers=headers1, proxies=self.proxys, timeout=3).json()
                    except:
                        self.proxy_get()
                        json_idss = requests.get(url=url2, headers=headers1, proxies=self.proxys, timeout=3).json()
                        time.sleep(1)
                    for dici in json_idss['data']['productImages']:
                        datass = dici['url']
                else:
                    url2 = 'https://www.zcool.com.cn/p1/article/' + dic['content']['idStr']
                    try:
                        json_idss = requests.get(url=url2, headers=headers1, proxies=self.proxys, timeout=3).json()
                    except:
                        self.proxy_get()
                        json_idss = requests.get(url=url2, headers=headers1, proxies=self.proxys, timeout=3).json()
                    time.sleep(1)
                    # datass = json_idss['data']['id']
                    for dici in json_idss['data']['creatorObj']['contentCards']:
                        datass = dici['cover1x']

                timeStamp = dic['content']['timeTitleStr']
                # timeArray = time.localtime(timeStamp)  # 转化成对应的时间
                # otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)  # 字符串
                # date = otherStyleTime
                photo = dic['content']['cover1x']

                data = {
                    'title': titles,
                    'urls': url13,
                    'address': timeStamp,
                    'configuration': types,
                    'grade': viewCountStrs,
                    'collections': subCateStrs,
                    'price': cateStrs,
                    'unit': photo,
                    'photoadress': datass
                    }

                print(data)
                self.save_mysql(data)

    def save_mysql(self, data):
        # str_sql = "insert into ftx values(0, '{}', '{}');".format(data['first_category'],data['second_category'])
        str_sql = "insert into meituan values(0, '{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}');".format(
                data['title'],  data['urls'],data['address'], data['configuration'], data['grade'], data['collections'],
                data['price'], data['unit'], data['photoadress'])

        self.cur.execute(str_sql)
        self.mysql.commit()

    def __del__(self):
        self.cur.close()
        self.mysql.close()


        # 入口函数
    def run(self):
        try:
            for i in range(1,5):
                url = self.url.format(i)
                print(i)
                # self.get_html(url)
                self.parse_html(url)
                time.sleep(random.randint(2, 4))
                # 每次抓取一页要初始化一次self.blog
                self.blog = 1
        except Exception as e:
            print('发生错误', e)


if __name__ == '__main__':
    spider = ZhankunSpider()
    spider.run()

2.结果展示

 

标签:get,python,self,爬虫,datetime,proxy,https,import,站酷
From: https://www.cnblogs.com/icekele/p/16928912.html

相关文章

  • Python global和nonelocal关键字详解
    本篇文章只是为了自己做个记录,记录一下这两个关键字的详细用法和区别,将会引用别人的文章,具体可见如下三篇:变量作用域Python中的作用域、global与nonlocal详细举例......
  • 在腾讯云上部署python flask项目
    最近在腾讯云上折腾了好久的docker,因为不熟悉用的挺混乱。今天总算把2个项目部署到腾讯云上去了,总结下思路,以防以后踩坑我的腾讯云使用的是CentOS7.6,最低档次的机器 本......
  • Python: global、local与nonlocal变量
    1local和global变量先来看一个最简单的Python程序例子:importnumpyasnpn=2deffunc(a):b=1returna+bprint(func(n))#3这里b声明在函数f......
  • 斐波那契数的矩阵算法及 python 实现
    importnumpyasnpimportmatplotlib.pyplotaspltfromfunctoolsimportreducefromsympyimportsqrt,simplify,fibonacciimportsympy斐波那契数的矩阵形式......
  • python中的高阶函数
    1.匿名函数#1.匿名函数lambda#简化代码减少占用的内存print('1.匿名函数lambda')deffunc():print(10)func()func=lambda:print(10)#出现警告的......
  • python--class基础
     (1)创建类(只包含方法)class类名:def方法1(self,参数列表):passdef方法2(self,参数列表):passself是必须参数,self代表对象本......
  • PYTHON_字典
    分模块积累,此模块为【字典】。1. 计算输入字符串中,各字母出现的次数。#方法一:s=input()dic={}foreins:ifenotindic:#若初次进入,则字典取值初始化为1......
  • python硬核表白
    print('\n'.join([''.join([('Love'[(x-y)%len('Love')]\if((x*0.05)**2+(y*0.1)**2-1)**3-(x*0.05)**2*(y*0.1)**3<=0else'')forxinrange(-30,30)])fory......
  • Selenium4+Python3系列(九) - 上传文件及滚动条操作
    一、上传文件操作上传文件是每个做自动化测试同学都会遇到,而且可以说是面试必考的问题,标准控件我们一般用send_keys()就能完成上传,但是我们的测试网站的上传控件一般为自......
  • python爬取某美食数据-全民厨子美食系列
    1、分析网页,爬取美食数据​​https://mip.xiachufang.com/explore/?page=2​​​​​​https://mip.xiachufang.com/explore/?page=3​​​url="​​​https://mip.xia......