首页 > 其他分享 >小红书爬虫秘籍:轻松获取时尚穿搭灵感

小红书爬虫秘籍:轻松获取时尚穿搭灵感

时间:2024-03-20 12:02:17浏览次数:24  
标签:www 秘籍 小红书 xiaohongshu 爬虫 explore result https com

大家好!今天我来分享一下如何使用Python爬虫来获取小红书上的时尚穿搭灵感。小红书作为国内最大的时尚生活社区之一,拥有众多的时尚达人和潮流穿搭内容,如果你想获取最新的时尚灵感,就不容错过这个简单又有效的爬虫方法。

在本文中,我将带领大家使用Python的Selenium和BeautifulSoup库,通过模拟用户操作和解析网页内容,来实现从小红书上获取穿搭灵感的爬虫。废话不多说,让我们开始吧!

准备工作

在开始之前,我们需要准备以下工具和环境:

  • Python 3.x
  • Selenium库
  • BeautifulSoup库
  • Chrome浏览器
  • Chrome驱动(根据浏览器版本下载)

确保你已经安装了Python和所需的库,并将Chrome驱动放置在合适的位置。

获取小红书文章链接

首先,我们需要获取小红书上的文章链接,以便后续爬取文章内容。为了实现这一步,我们使用了一个名为GetUrl.py的脚本,它使用了Selenium库来模拟用户操作,滚动页面并获取文章链接。

以下是GetUrl.py的主要代码:

#!/usr/bin/env python3
# coding:utf-8
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
import bag
import time
import random


def main():
    result = []
    count = []
    while True:
        try:
            WebDriverWait(web, 30).until(
                EC.presence_of_all_elements_located((By.XPATH, r'//*[@id="exploreFeeds"]')))
            web.find_element(By.TAG_NAME, r'body').send_keys(Keys.END)
            web.implicitly_wait(random.randint(50, 100))
            web.implicitly_wait(10)
            time.sleep(2)
            links = web.find_elements(By.XPATH, r'//*[@id="exploreFeeds"]/section/div/div/a')
            for link in links:
                url = link.get_attribute('href')
                if 'user' in url:
                    pass
                elif url not in result:
                    result.append(url)
                else:
                    pass
            count.append(len(result))
            print(len(result))
            if count.count(count[-1]) > 30:
                break
        except StaleElementReferenceException:
            time.sleep(5)
            web.execute_script(
                "window.scrollTo(0, document.body.scrollHeight / {});".format(random.randint(2, 3)))
        if len(result) >= 100:
            break
    # bag.Bag.save_json(result, path)
    return result


if __name__ == '__main__':
    web = bag.Bag.web_debug()
    # path = r'./小红书(穿搭).json'
    # if os.path.isfile(path):
    #     result = bag.Bag.read_json(path)
    # else:
    #     result = []
    main()

以上代码通过模拟按下END键滚动页面,并使用XPath定位获取文章链接,最终返回一个包含所有文章链接的列表

# 小红书(穿搭).json

[
    "https://www.xiaohongshu.com/explore/647456080000000013001d32",
    "https://www.xiaohongshu.com/explore/64756d910000000013008834",
    "https://www.xiaohongshu.com/explore/6476d212000000001300223c",
    "https://www.xiaohongshu.com/explore/6475ca17000000001303489f",
    "https://www.xiaohongshu.com/explore/646724320000000013003afe",
    "https://www.xiaohongshu.com/explore/6468ac66000000001300c40c",
    "https://www.xiaohongshu.com/explore/646ff3420000000013004c63",
    "https://www.xiaohongshu.com/explore/6465f68b0000000013014649",
    "https://www.xiaohongshu.com/explore/6477436f000000001203de52",
    "https://www.xiaohongshu.com/explore/647427420000000012032c28",
    "https://www.xiaohongshu.com/explore/6472ef100000000011010a28",
    "https://www.xiaohongshu.com/explore/646ec809000000002701106f",
    "https://www.xiaohongshu.com/explore/646f734e000000002702833b",
    "https://www.xiaohongshu.com/explore/6470a18c0000000013005e41",
    "https://www.xiaohongshu.com/explore/6470920a0000000013035a3c",
    "https://www.xiaohongshu.com/explore/64687088000000001300b2e8",
    "https://www.xiaohongshu.com/explore/6468b6c60000000014024c64",
    "https://www.xiaohongshu.com/explore/646899030000000013035b16",
    "https://www.xiaohongshu.com/explore/646ae0a3000000000800d11c",
    "https://www.xiaohongshu.com/explore/6468fe10000000001300a508",
    "https://www.xiaohongshu.com/explore/6475dafd0000000027010763",
    "https://www.xiaohongshu.com/explore/6475da7e0000000013000f19",
    "https://www.xiaohongshu.com/explore/6478388f000000001300d5ca",
    "https://www.xiaohongshu.com/explore/6477258d000000001303e0c0",
    "https://www.xiaohongshu.com/explore/646996de000000000703a943",
    "https://www.xiaohongshu.com/explore/646b02bb0000000027012d49",
    "https://www.xiaohongshu.com/explore/64709647000000000703ae75",
    "https://www.xiaohongshu.com/explore/647087c00000000027011878",
    "https://www.xiaohongshu.com/explore/64757cbc0000000013014a00",
    "https://www.xiaohongshu.com/explore/647348af0000000013037ab2",
    "https://www.xiaohongshu.com/explore/646617240000000027002540",
    "https://www.xiaohongshu.com/explore/646ca2ba0000000013037a15",
    "https://www.xiaohongshu.com/explore/64685572000000000703836c",
    "https://www.xiaohongshu.com/explore/6475e92d000000001300e694",
    "https://www.xiaohongshu.com/explore/64773b3f0000000013035447",
    "https://www.xiaohongshu.com/explore/646f7fa80000000013030224",
    "https://www.xiaohongshu.com/explore/646620300000000007039fcd",
    "https://www.xiaohongshu.com/explore/6469b2960000000013032102",
    "https://www.xiaohongshu.com/explore/64686d450000000013012004",
    "https://www.xiaohongshu.com/explore/6468268a000000001300f9a0",
    "https://www.xiaohongshu.com/explore/64683cd9000000002702b38e",
    "https://www.xiaohongshu.com/explore/646c80b20000000011011f67",
    "https://www.xiaohongshu.com/explore/64775be9000000002702a98f",
    "https://www.xiaohongshu.com/explore/6475f76a000000001300f020",
    "https://www.xiaohongshu.com/explore/6469a73d0000000027012cde",
    "https://www.xiaohongshu.com/explore/646dc5b500000000110113ce",
    "https://www.xiaohongshu.com/explore/646e0cb2000000001203d2d5",
    "https://www.xiaohongshu.com/explore/6476bf690000000013030d9a",
    "https://www.xiaohongshu.com/explore/6477194b000000000800ec38",
    "https://www.xiaohongshu.com/explore/64747523000000001300c8c4",
    "https://www.xiaohongshu.com/explore/6474b5b8000000001203c201",
    "https://www.xiaohongshu.com/explore/6470a79d0000000014026113",
    "https://www.xiaohongshu.com/explore/6471a4270000000013005ee3",
    "https://www.xiaohongshu.com/explore/646c8bad000000000703a8eb",
    "https://www.xiaohongshu.com/explore/646f27110000000007038aba",
    "https://www.xiaohongshu.com/explore/646f2a24000000001303e883",
    "https://www.xiaohongshu.com/explore/646f467100000000130376ee",
    "https://www.xiaohongshu.com/explore/6464ec9f0000000013015f91",
    "https://www.xiaohongshu.com/explore/6472225700000000130090e9",
    "https://www.xiaohongshu.com/explore/646b6d7a000000000800e6f2",
    "https://www.xiaohongshu.com/explore/644a82930000000027010550",
    "https://www.xiaohongshu.com/explore/644a4893000000001203eb72",
    "https://www.xiaohongshu.com/explore/6462206d0000000013000279",
    "https://www.xiaohongshu.com/explore/646175ad00000000270285ea",
    "https://www.xiaohongshu.com/explore/645af993000000001402417e",
    "https://www.xiaohongshu.com/explore/645b603b0000000027010c66",
    "https://www.xiaohongshu.com/explore/644016420000000013035988",
    "https://www.xiaohongshu.com/explore/644f54c90000000013008fe8",
    "https://www.xiaohongshu.com/explore/644fa915000000001303542f",
    "https://www.xiaohongshu.com/explore/6450fd6c000000001303130f",
    "https://www.xiaohongshu.com/explore/6450d8ff0000000013014dcb",
    "https://www.xiaohongshu.com/explore/6450c126000000001203f238",
    "https://www.xiaohongshu.com/explore/6450d23d000000000800d4f5",
    "https://www.xiaohongshu.com/explore/64562297000000001303e2db",
    "https://www.xiaohongshu.com/explore/645f627c0000000011012c97",
    "https://www.xiaohongshu.com/explore/644caf47000000001301486f",
    "https://www.xiaohongshu.com/explore/644bca75000000002702aa2c",
    "https://www.xiaohongshu.com/explore/645a12b900000000270289dd",
    "https://www.xiaohongshu.com/explore/645ccbf3000000001300a1fb",
    "https://www.xiaohongshu.com/explore/645c585a0000000013037e4b",
    "https://www.xiaohongshu.com/explore/644cdca60000000027003411",
    "https://www.xiaohongshu.com/explore/64585b5c0000000014026a6b",
    "https://www.xiaohongshu.com/explore/6458c86c0000000013030ee7",
    "https://www.xiaohongshu.com/explore/6458c4af0000000013005945",
    "https://www.xiaohongshu.com/explore/6458f2a30000000027028641",
    "https://www.xiaohongshu.com/explore/6458cb76000000000800d190",
    "https://www.xiaohongshu.com/explore/6457a2af00000000130090c3",
    "https://www.xiaohongshu.com/explore/64630cc80000000027029630",
    "https://www.xiaohongshu.com/explore/645a2db100000000110101e7",
    "https://www.xiaohongshu.com/explore/6461f9380000000012032775",
    "https://www.xiaohongshu.com/explore/64512005000000001300e298",
    "https://www.xiaohongshu.com/explore/64551a9b000000000800dc92",
    "https://www.xiaohongshu.com/explore/64534539000000000800d044",
    "https://www.xiaohongshu.com/explore/644d3c550000000027000537",
    "https://www.xiaohongshu.com/explore/644cd7190000000012030bf9",
    "https://www.xiaohongshu.com/explore/64539ffd00000000130017f0",
    "https://www.xiaohongshu.com/explore/645454b7000000002700368d",
    "https://www.xiaohongshu.com/explore/644f4a06000000001203f270",
    "https://www.xiaohongshu.com/explore/644c7f960000000013008576",
    "https://www.xiaohongshu.com/explore/6449225d000000001300e292",
    "https://www.xiaohongshu.com/explore/64525bfc0000000013012bb4",
    "https://www.xiaohongshu.com/explore/645b32fb0000000027003867",
    "https://www.xiaohongshu.com/explore/645b29700000000014025eb5",
    "https://www.xiaohongshu.com/explore/645234780000000027001908",
    "https://www.xiaohongshu.com/explore/6450f8d5000000001203c476",
    "https://www.xiaohongshu.com/explore/6451c5fb000000001303fd11",
    "https://www.xiaohongshu.com/explore/644a7879000000001303eb12"
]

爬取文章内容 

接下来,我们使用获取到的文章链接来爬取文章的标题和内容。为了实现这一步,我们使用了一个名为xiaohongshu.py的脚本,它使用了BeautifulSoup库来解析网页内容,获取文章的标题和正文。

以下是xiaohongshu.py的主要代码:

#!/usr/bin/env python3
# coding:utf-8
from bs4 import BeautifulSoup
from tqdm import tqdm
import bag
import time
from concurrent.futures import ProcessPoolExecutor
import GetUrl


session = bag.session.create_session()
session.cookies[''] = r'你的cookies'


def main():
    result = []
    with ProcessPoolExecutor(max_workers=20) as pro:
        tasks = []
        for url in tqdm(urls):
            tasks.append(pro.submit(get_data, url))
        for task in tqdm(tasks):
            result.extend(task.result())
    bag.Bag.save_excel(result, r'./小红书(穿搭).xlsx')


def get_data(url):
    result = []
    resp = session.get(url)
    time.sleep(2)
    resp.encoding = 'utf8'
    resp.close()
    html = BeautifulSoup(resp.text, 'lxml')
    soup = html.find_all('div', class_="desc")
    title = html.find_all('div', class_='title')[0].text
    mid = []
    for i in soup:
        mid.append(i.text.strip())
    result.append([
        '穿搭',
        title,
        '\n'.join(mid),
        url
    ])
    return result


if __name__ == '__main__':
    web = bag.Bag.web_debug()
    GetUrl.web = web
    urls = GetUrl.main()
    main()

以上代码使用了多线程的方式并发爬取文章内容,通过解析网页获取文章的标题和正文,并将结果保存为Excel文件

运行结果

结语

通过使用Python爬虫,可以轻松地获取小红书上的时尚穿搭灵感。本文中,我们使用了SeleniumBeautifulSoup库,通过模拟用户操作和解析网页内容,实现了从小红书上获取穿搭灵感的爬虫。

希望本文对于想要获取时尚灵感的你有所帮助。如果你有任何疑问或意见,请在评论区留言,我将尽快回复。谢谢阅读!

标签:www,秘籍,小红书,xiaohongshu,爬虫,explore,result,https,com
From: https://blog.csdn.net/FLK_9090/article/details/136870512

相关文章

  • Python爬虫实战系列4:天眼查公司工商信息采集
    Python爬虫实战系列1:博客园cnblogs热门新闻采集Python爬虫实战系列2:虎嗅网24小时热门新闻采集Python爬虫实战系列3:今日BBNews编程新闻采集Python爬虫实战系列4:天眼查公司工商信息采集一、分析页面打开天眼查网址https://www.tianyancha.com/,随便搜索一个公司【比亚迪】查......
  • 爬虫实战:从HTTP请求获取数据解析社区
    在过去的实践中,我们通常通过爬取HTML网页来解析并提取所需数据,然而这只是一种方法。另一种更为直接的方式是通过发送HTTP请求来获取数据。考虑到大多数常见服务商的数据都是通过HTTP接口封装的,因此我们今天的讨论主题是如何通过调用接口来获取所需数据。目前来看,大多数的http接口......
  • Node.js毕业设计仿小红书app(Express)
    本系统(程序+源码)带文档lw万字以上  文末可获取本课题的源码和程序系统程序文件列表系统的选题背景和意义选题背景:随着互联网技术的迅猛发展,社交媒体应用已成为人们日常生活中不可或缺的一部分。小红书作为一款集社区分享、电商购物于一体的综合性平台,以其独特的内容推荐......
  • Python从入门到精通秘籍八
    一、Python中函数的多返回值在Python中,函数可以返回多个值。这种特性可以通过将多个变量包装在一个元组或列表中来实现。下面是一个示例代码:defmultiple_returns():a=1b=2c=3returna,b,cresult=multiple_returns()print(result)#输出:(......
  • 强大易用!新一代爬虫利器 Playwright 的介绍
    Playwright是微软在2020年初开源的新一代自动化测试工具,它的功能类似于Selenium、Pyppeteer等,都可以驱动浏览器进行各种自动化操作。它的功能也非常强大,对市面上的主流浏览器都提供了支持,API功能简洁又强大。虽然诞生比较晚,但是现在发展得非常火热。因为Playwright......
  • Python爬虫是什么?核心概念和原理
    一、爬虫的概念和作用1.1概念:        网络爬虫也叫网络蜘蛛,特指一类自动批量下载网络资源的程序,这是一个比较口语化的定义。更加专业和全面对的定义是:网络爬虫是伪装成客户端与服务端进行数据交互的程序.1.2作用1.2.1数据采集        大数据时代来临......
  • 抖音无水印视频批量下载|视频爬虫采集工具
    抖音无水印视频批量下载神器,关键词搜索轻松获取您想要的视频!    随着抖音视频内容日益丰富,您是否常常希望能够批量下载您感兴趣的视频,但现有工具只支持单个链接提取,操作繁琐?别担心,q1977470120我们特别开发了一款强大的抖音视频批量下载工具,让您通过关键词搜索轻松获取......
  • 爬虫实战:从网页到本地,如何轻松实现小说离线阅读
    今天我们将继续进行爬虫实战,除了常规的网页数据抓取外,我们还将引入一个全新的下载功能。具体而言,我们的主要任务是爬取小说内容,并实现将其下载到本地的操作,以便后续能够进行离线阅读。为了确保即使在功能逐渐增多的情况下也不至于使初学者感到困惑,我特意为你绘制了一张功能架构图......
  • Python爬虫--1
    Python爬虫小组:255、229、218、219一.安装软件软件名称:PyCharm可以到PyCharm官网下载免费的Community(社区)版本的PyCharm,这个版本虽然不及收费的Professional(专业)版本的PyCharm功能强大,但对于一般应用足够了。(书上抄的)二.匹配数据在根目录上右击,新建一个目录,再在此目录里新建......
  • 爬虫实战:从外地天气到美食推荐,探索干饭人的世界
    今天是第二堂课,我们将继续学习爬虫技术。在上一节课中,我们已经学会了如何爬取干饭教程。正如鲁迅所说(我没说过),当地吃完饭就去外地吃,这启发了我去爬取城市天气信息,并顺便了解当地美食。这个想法永远是干饭人的灵魂所在。今天我们的目标是学习如何爬取城市天气信息,因为要计划去哪里......