python爬虫博客

时间：2022-10-12 19:37:29浏览次数：65

标签：python 爬虫博客 page article post find datas icon

import requests
import json
from pprint import pprint
from bs4 import BeautifulSoup

url = "https://www.cnblogs.com/AggSite/AggSitePostList"
headers = {
  #"content-type": "application/json; charset=UTF-8",  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}


def craw_page(page_index):
  data = {"CategoryType": "SiteHome",          "ParentCategoryId": 0,          "CategoryId": 808,          "PageIndex": page_index,          "TotalPostCount": 4000,          "ItemListActionName": "AggSitePostList"}
  #resp = requests.post(url, json.dumps(data), headers=headers)  resp = requests.post(url, json=data, headers=headers)
  return resp.text


def parse_data(html):
  soup = BeautifulSoup(html, "html.parser")
  articles = soup.find_all("article", class_="post-item")
  datas = []
  for article in articles:
    link = article.find("a", class_="post-item-title")
    title = link.get_text()
    href = link["href"]

    author = article.find("a", class_="post-item-author").get_text()

    icon_digg = 0    icon_comment = 0    icon_views = 0    #for a in article.find_all("a"):    for a in article.find_all("a", class_="post-meta-item btn"):
      if "icon_digg" in str(a):
        icon_digg = a.find("span").get_text()
      if "icon_comment" in str(a):
        icon_comment = a.find("span").get_text()
      if "icon_views" in str(a):
        icon_views = a.find("span").get_text()

    datas.append([title, href, author, icon_digg, icon_comment, icon_views])
  return datas

if __name__ == "__main__":
  all_datas = []
  for page in range(5):
    print("正在爬取：", page)
    html = craw_page(page)
    datas = parse_data(html)
    all_datas.extend(datas)
    #pprint(all_datas)
  import pandas as pd
  df = pd.DataFrame(all_datas, columns=["title", "href", "author", "icon_digg", "icon_comment", "icon_views"])
  df.to_excel("./博客园200页文章信息.xlsx", index=False)

标签：python,爬虫,博客,page,article,post,find,datas,icon
From： https://blog.51cto.com/u_14044882/5751342

[python] JSON
[python]JSONJSON(JavaScriptObjectNotation,JS对象标记)是一种轻量级的数据交换格式。JSON的数据格式其实就是python里面的字典格式，里面可以包含方括号括起来的数......
DEMO:下载模板，上载数据，alv展示checkbox热键等_SAP刘梦_新浪博客
*&---------------------------------------------------------------------**& Report ZDEMO_UPLOAD*&......
DEMO:针对销售订单的贷项凭证开票BAPI_BILLINGDOC_CREATEMULTIPLE_SAP刘梦_新浪博客
开票，冲销，再开票，VBFA和VBRK去查看如果VF01如果不输入日期，默认是读取订单上的开票日期*&-----------------------......
DEMO:上载XML到内表_SAP刘梦_新浪博客
*&---------------------------------------------------------------------**&ReportZLM_XML_UPLOAD*&......
欢迎加入SAP干货群_SAP刘梦_新浪博客
里面基本都是我这些年整理的ABAP方面的知识经验。欢迎关注SAP干货铺 ......
公众号相关文章整理：增强相关_SAP刘梦_新浪博客
增强相关行项目描述1VF01开票增强2客户主数据增强3IDOCchangepoint enhancement4BAPI_ACC_DOCUMENT_PO......
json python
jsonpython阅读目录序列化模块json模块回到顶部序列化模块1，定义序列化：就是将一种数据结构（如字典，裂变）等转换成一个特殊的序列（字符串或者bytes）的过程就叫序列化序列化......
python 装饰器
参考链接：https://www.liaoxuefeng.com/wiki/1016959663602400/1017451662295584https://blog.csdn.net/zhh763984017/article/details/120072425......
气象NC扇形经纬网格转换成前端要求的等经纬网格_cwr888的博客
气象NC扇形经纬网格转换成前端leaflet-vector-scalar.js要求的等经纬网格背景：最近从气象局拿到文件格式为NC的气象文件(包括温度、湿度、风、气压、雨量等)，需要读取其中的......
python第十三课---
昨日内容回顾global与nonlocal关键字global 用于局部名称空间修改全局名称空间中的名字绑定关系nonlocal 用于局部名称空间修改外层局部名称空间中的名字绑定关系......

python爬虫博客

相关文章

赞助商

阅读排行