首页 > 编程语言 >python数据抓取,抓点星星网的内容

python数据抓取,抓点星星网的内容

时间:2023-02-09 14:44:39浏览次数:45  
标签:抓点 cookies python 抓取 url split file print response

代码:

#coding=utf-8

import os,sys,re
import requests
from webob.exc import strip_tags
from xpinyin import Pinyin

def str2dict(str):
    dict = {}
    groups1 = str.split(";");
    for group1 in groups1:
        if not group1:
            continue
        arr = group1.split("=")
        dict[arr[0].strip()] = arr[1].strip()
    return dict

cookie = 'gr_user_id=e=1675905351'

baseurl = "https://www.xxwolo.com/";
cookies = str2dict(cookie)
p = Pinyin()
#print(cookies)
#sys.exit(0)

url0 = baseurl+"posts/topic/tsptf23456789011?v=Sun"
response = requests.get(url=url0, cookies=cookies)
strcommon = response.text
jiashus = strcommon.split("<div class=\"dyn_list\">")[1].split("</div>")[0]

resjiashu = re.findall("<a href=\"\/info\/(.*?)\" class=\"dyn_item\">([\s\S]*?)<span>(.*?)<\/span>", jiashus)
for resjiashu1 in resjiashu:
    jiashumd5 = resjiashu1[0]
    jiashuname = resjiashu1[2]
    if jiashuname.find("谢谢") < 0:
        continue
    jiashunameen = p.get_pinyin(jiashuname)
    url = baseurl+"posts/topic/"+jiashumd5+"?v=Sun"
    response = requests.get(url=url, cookies=cookies)
    str0 = response.text
    fenleis = str0.split("<div class=\"rp_words\">")[1].split("</div>")[0]
    #print(fenleis)
    res = re.findall("<a href=\"(.*?)\">(.*?)<\/a>", fenleis)
    res.insert(0, (url, '太阳'))
    
    for r in res:
        url1 = r[0]
        title1 = r[1]
        print("开始获取:"+title1+"...")
        print("地址:"+url1+"...\n")
        
        response = requests.get(url=url1, cookies=cookies)
        str1 = response.text
        str1 = str1.split("<div class=\"rp_list\">")[1].split("<div style=\"text-align:center;margin:2em;\">")[0]
        #print(str1)
        res1 = re.findall("<a class=\"tit\" href=\"/(.*?)\" title=\"(.*?)\">(.*?)<\/a>([\s\S]*?)<p class=\"ellip_clamp\">([\s\S]*?)<\/p>", str1)
        
        for r1 in res1:
            url2 = baseurl + r1[0]
            title2 = r1[2]
            titleen2 = p.get_pinyin(title2)
            content1 = strip_tags(r1[4])
            print("开始获取:"+jiashuname+"->"+title1+"->"+title2+"...")
            print("地址:"+url2+"...\n")
            if url2 == baseurl+'t/':
                res2 = [content1]
            else:
                response = requests.get(url=url2, cookies=cookies)
                str2 = response.text
                str2 = str2.split("<div class=\"ce_content\"  >")[1].split("<!--添加解读-->")[0]
                res2 = re.findall("<p class=\"note\">(.*?)<\/p>", str2)

            filename = jiashuname+"/"+title2+".txt"
            filename2 = jiashuname+"_汇总.txt"
            dirname = os.path.dirname(filename)
            if not os.path.exists(dirname):
                os.mkdir(dirname)
            
            for test1 in res2:
                html2 = ""
                if os.path.exists(filename):
                    file = open(filename, 'r', encoding='utf-8')
                    html2 = file.read()
                    file.close()
                
                if html2.find(test1) >= 0:
                    continue
                
                try:
                    file = open(filename, 'a', encoding='utf-8')
                    file.write(test1+"\n\n")
                    file.close()
                except Exception as e:
                    print("写入文件["+filename+"]失败")
                    print(test1+"\n\n")
                    print(e)
                    sys.exit(0)
                
                try:
                    file = open(filename2, 'a', encoding='utf-8')
                    file.write(title2+"\n"+test1+"\n\n")
                    file.close()
                except Exception as e:
                    print("写入文件["+filename2+"]失败")
                    print(test1+"\n\n")
                    print(e)
                    sys.exit(0)
    

 

效果:

 

 

 

标签:抓点,cookies,python,抓取,url,split,file,print,response
From: https://www.cnblogs.com/xuxiaobo/p/17105194.html

相关文章

  • Python-celery介绍与快速上手
    1.celery介绍:  celery是一个基于Python开发的模块,可以帮助我们在开发过程中,对任务进行分发和处理。            详细介绍取自:Python之celery的简介与使......
  • Python之ruamel.yaml模块详解(一)
    (Python之ruamel.yaml模块详解(一))1ruamel.yaml简介ruamel.yaml是一个yaml解析器;ruamel.yaml是一个用于Python的yaml1.2加载器/转储程序包;它是PyYAML3.11的衍生产品;r......
  • python多维数组的每列的最值
    python代码实现importnumpyasnpdefmaxmin(array):#求每列的最值maxlist=[]minlist=[]foriinrange(len(array[0])):#行数col=[]......
  • python3 时间戳转换
    importtimedeftime_conversion(times):#转换成新的时间格式(2016-05-0520:28:54)dt=time.strftime("%Y-%m-%d%H:%M:%S",time.localtime(times))......
  • Python 分支结构
    阅读目录​​应用场景​​​​if语句的使用​​​​例子1:英制单位英寸与公制单位厘米互换​​​​例子2:百分制成绩转换为等级制成绩​​​​例子3:输入三条边长,如果能构成三......
  • Python爬虫爬取html中div下的多个class标签并存入数据库
    使用python爬虫爬取html页面div中的多个class标签,获取后将数据存成列表,然后存入数据库importmysql.connectorimportpymysqlimportrequestsfrombs4importBeautif......
  • python开发笔记--ImportError: cannot import name 'sysconfig' from 'distutils' (/u
    异常情况:ubuntu20.4安装python3.10和pip后,执行piplist提示如下:ImportError:cannotimportname'sysconfig'from'distutils'(/usr/lib/python3.8/distutils/__init__......
  • Vitualenvwrapper: 管理 python的 虚拟环境
            linux:install  Vitualenvwrapper[user]-3D05SQ3:~/python/managing-python-packages-virtual-environments/flask/venv$python3-mpipi......
  • 1、python_批量检测端口号
    #!/usr/bin/envpython#coding:utf-8#Author:zikangimportsocketlist_str='''172.31.7.1038080172.31.7.1046379172.31.7.1053306'''OK_list=[]Timeo......
  • 第04天-python函数与高阶函数
    1、函数和传实参1.1、Python函数1.1.1、函数的三分类数学定义y=f(x),y是x的函数,x是自变量。y=f(x0,x1,...,xn)Python函数由若干语句组成的语句块、函数名称、参......