首页 > 其他分享 >大麦网演唱会信息爬取

大麦网演唱会信息爬取

时间:2022-10-04 09:11:18浏览次数:52  
标签:__ box search 爬取 大麦 div main page 演唱会

main.py

from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery
import pymongo
from config import *
import re

options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
wait = WebDriverWait(browser, 10)

client = pymongo.MongoClient(MONGO_URL, MONGO_PORT)
db = client[MONGO_DB]

def search_page():
    try:
        browser.get("https://www.damai.cn/")
        input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "body > div.dm-header-wrap > div > div.search-header > input")))
        submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "body > div.dm-header-wrap > div > div.search-header > div.btn-search")))
        input.send_keys("演唱会")
        submit.click()
        total = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "body > div.search-box > div.search-box-top > span.search-box-keyword")))
        print ("共找到" + total.text + "个结果")
        get_products()
        return True
    except TimeoutError:
        search_page()

def next_page(index):
    try:
        page_css_id = "body > div.search-box > div.search-box-flex > div.search-main > div.search__itemlist > div.pagination > div > ul > li:nth-child(" + str(index)+")"
        switch_page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, page_css_id)))
        switch_page.click()
        now_page = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "body > div.search-box > div.search-box-flex > div.search-main > div.search-sort.search-main-sort > div.pagination-top.search-sort_fr > div > span:nth-child(1)")))
        if str(index) == now_page.text:
            print("切换到第" + str(index) + "页")
            get_products()
        else:
            next_page(index)
    except TimeoutError:
        next_page(index)

def get_products():
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "body > div.search-box > div.search-box-flex > div.search-main > div.search__itemlist > div.item__main > div")))
    html = browser.page_source
    doc = PyQuery(html)
    items = doc('body > div.search-box > div.search-box-flex > div.search-main > div.search__itemlist > div.item__main > div')
    for item in items.children().items():
        course_name = item.find("div > div.items__txt__title > a").text()
        foo = item.find("div > div:nth-child(2)").text()
        if "艺人:" in foo:
            people_name = re.sub("艺人:", "", foo)
            address = ""
        else:
            address = foo
            people_name = ""
        if not address:
            address = item.find("div > div:nth-child(3)").text()
            course_date = item.find("div > div:nth-child(4)").text()
        else:
            course_date = item.find("div > div:nth-child(3)").text()
        product = {
            "演唱会名字:": course_name,
            "乐队名字:": people_name,
            "演唱地点": address,
            "演唱日期": course_date
        }
        print (product)
        save_to_mongo(product)

def save_to_mongo(result):
    if db[MONGO_TABLE].insert_one(result):
        print("存储到数据库", result)
    else:
        print("存储数据出错", result)

def main():
    search_page()
    for num in range(2, 6):
        next_page(num)
        sleep(2)

if __name__ == "__main__":
    main()

settings.py

MONGO_URL = 'localhost'
MONGO_PORT = 27017
MONGO_DB = 'damai'
MONGO_TABLE = 'yanchanghui'

SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']

标签:__,box,search,爬取,大麦,div,main,page,演唱会
From: https://www.cnblogs.com/z5onk0/p/16753192.html

相关文章

  • 给女朋友写的一个利用搜索引擎爬取会议论文的脚本
    importbs4,requests,osfrommultiprocessingimportManager,Pool#红色:报错defR(message):return"\033[1;91m{}\033[0;m".format(message)#绿色:成功defG......
  • python爬取黑马网站
         ......
  • 使用 Scrapy + Selenium 爬取动态渲染的页面
    在通过scrapy框架进行某些网站数据爬取的时候,往往会碰到页面动态数据加载的情况发生,如果直接使用scrapy对其url发请求,是绝对获取不到那部分动态加载出来的数据值。但是通过......
  • python爬虫入门教程:爬取网页图片
    在现在这个信息爆炸的时代,要想高效的获取数据,爬虫是非常好用的。而用python做爬虫也十分简单方便,下面通过一个简单的小爬虫程序来看一看写爬虫的基本过程: 首先是要用到......
  • 核酸管理网站爬取
    fromseleniumimportwebdriverfromselenium.webdriverimportActionChainsfromselenium.webdriver.common.byimportByimporttimeimportdatetime#date=st......
  • 多线程爬取wallhaven
    前言最近整理自己的项目时,发现之前自己写的爬取wallhaven网站顿时有来的兴趣决定再写一遍来回顾自己以前学的知识网站地址:"https://wallhaven.cc/"1.url参数结构从ur......
  • 爬取某东的小米的手机信息20页 用selenium来爬取
    importtime#fromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefrombs4importBeautifulSoupfromselenium.webdriver.commo......
  • Python爬取全球疫情数据,制作数据可视化图
    前言开发环境python3.8:解释器pycharm:代码编辑器requests发送请求pyecharts绘制图表pandas读取数据爬虫案例思路流程:一.数据来源分析:......
  • 批量爬取抖音视频
    先创建dou_url.txt,其实不用创建也行,运行第一遍代码的时候程序会自动创建dou_url.txt里面可以填写多个视频url即可爬取,也可以填url加文字,因为用了正则表达式匹配。。。比......
  • 爬取BiliBili视频
    https://github.com/BtbN/FFmpeg-Builds/releases/tag/latest这是ffmpeg下载地址,下载好要配置环境变量,合成视频要用到,因为B站的视频和音频是分开的花了段时间分析的下Bi......