使用Python 和 Selenium 抓取酷狗音乐专辑附源码

标签：album title Python 专辑 song Selenium url 源码 get

在这篇博客中，我将分享如何使用Python和Selenium抓取酷狗音乐网站上的歌曲信息。我们将使用BeautifulSoup解析HTML内容，并提取歌曲和专辑信息。

依赖库

requests
beautifulsoup4
selenium

准备工作

首先，我们需要安装一些必要的库：

pip install requests beautifulsoup4 selenium

步骤

第一步：初始化参数

我们使用Options配置Chrome浏览器为无头模式，并设置其他参数以确保浏览器在服务器环境中正常运行。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

第二步：URL参数编码

我们使用urllib.parse.quote对输入的歌手名进行URL编码，以便在搜索URL中使用。

import urllib.parse

keyword = input('请输入歌手：')
search_url = f'https://www.kugou.com/yy/html/search.html#searchType=song&searchKeyWord={urllib.parse.quote(keyword)}'

第三步：使用Selenium打开页面

我们使用Selenium打开酷狗音乐的搜索页面，并等待页面加载完成。

driver = webdriver.Chrome(options=chrome_options)
driver.get(search_url)
driver.implicitly_wait(10)
html_content = driver.page_source
driver.quit()

第四步：解析HTML内容

我们使用BeautifulSoup解析页面源代码，并提取歌曲和专辑信息。

from bs4 import BeautifulSoup as be

soup = be(html_content, 'html.parser')
albums = soup.find_all('a', class_='album_name')
songs = soup.find_all('a', class_='song_name')

第五步：打印结果

我们迭代提取的歌曲和专辑信息，并打印每首歌的名称、专辑和链接。

import requests

assert len(songs) == len(albums)

for song, album in zip(songs, albums):
    song_title = song.get('title')
    album_title = album.get('title')
    album_url = album.get('href')

    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {album_url}')

    album_response = requests.get(album_url)
    album_soup = be(album_response.text, 'html.parser')
    audio_elements = album_soup.find_all('audio')

    for audio in audio_elements:
        mp3_url = audio.get('src')
        if mp3_url:
            print(f'专辑链接: {mp3_url}')

完整代码

以下是完整的代码：

import os
import requests
from bs4 import BeautifulSoup as be
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.parse

# 初始化参数
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# URL参数编码
keyword = input('请输入歌手：')
search_url = f'https://www.kugou.com/yy/html/search.html#searchType=song&searchKeyWord={urllib.parse.quote(keyword)}'

# 第一步：使用Selenium打开页面
driver = webdriver.Chrome(options=chrome_options)
driver.get(search_url)

# 等待页面加载完成
driver.implicitly_wait(10)

# 获取页面源代码
html_content = driver.page_source

# 关闭浏览器
driver.quit()

# 第二步：解析HTML内容以提取所需的歌曲信息
soup = be(html_content, 'html.parser')
albums = soup.find_all('a', class_='album_name')

songs = soup.find_all('a', class_='song_name')

# 确保 songs 和 albums 的长度相同
assert len(songs) == len(albums)

# 同时迭代 songs 和 albums
for song, album in zip(songs, albums):
    song_title = song.get('title')
    album_title = album.get('title')
    album_url = album.get('href')

    # 如果专辑名为空，打印 "无专辑"
    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {album_url}')

    # 请求专辑页面
    album_response = requests.get(album_url)
    album_soup = be(album_response.text, 'html.parser')

    # 查找专辑页面中的音频文件链接
    audio_elements = album_soup.find_all('audio')

    for audio in audio_elements:
        mp3_url = audio.get('src')
        if mp3_url:
            print(f'专辑链接: {mp3_url}')

代码解析：

初始化参数：我们使用Options配置Chrome浏览器为无头模式，并设置其他参数以确保浏览器在服务器环境中正常运行。
URL参数编码：我们使用urllib.parse.quote对输入的歌手名进行URL编码，以便在搜索URL中使用。
使用Selenium打开页面：我们使用Selenium打开酷狗音乐的搜索页面，并等待页面加载完成。
解析HTML内容：我们使用BeautifulSoup解析页面源代码，并提取歌曲和专辑信息。
打印结果：我们迭代提取的歌曲和专辑信息，并打印每首歌的名称、专辑和链接。

运行结果：

爬虫项目推荐

其他项目推荐

扩展示例 1：保存歌曲信息到 CSV 文件

我们可以将抓取到的歌曲信息保存到 CSV 文件中，以便后续分析和处理。

import csv

with open('songs.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['歌名', '专辑', '链接']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for song, album in zip(songs, albums):
        song_title = song.get('title')
        album_title = album.get('title')
        song_url = song.get('href')

        if not album_title:
            album_title = "无专辑"

        writer.writerow({'歌名': song_title, '专辑': album_title, '链接': song_url})

扩展示例 2：多线程抓取

为了提高抓取效率，我们可以使用多线程来并行抓取歌曲信息。

import threading

def fetch_song_info(song, album):
    song_title = song.get('title')
    album_title = album.get('title')
    song_url = song.get('href')

    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {song_url}')

    song_response = requests.get(song_url)
    song_soup = be(song_response.text, 'html.parser')
    lyrics = song_soup.find('div', class_='lyrics')

    if lyrics:
        print(f'歌词: {lyrics.text}')

threads = []
for song, album in zip(songs, albums):
    thread = threading.Thread(target=fetch_song_info, args=(song, album))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

扩展示例 3：使用代理

为了避免被网站封禁，我们可以使用代理来进行抓取。

proxies = {
    'http': 'http://your_proxy:port',
    'https': 'https://your_proxy:port',
}

for song, album in zip(songs, albums):
    song_title = song.get('title')
    album_title = album.get('title')
    song_url = song.get('href')

    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {song_url}')

    song_response = requests.get(song_url, proxies=proxies)
    song_soup = be(song_response.text, 'html.parser')
    lyrics = song_soup.find('div', class_='lyrics')

    if lyrics:
        print(f'歌词: {lyrics.text}')

总结

在这个项目中，你学习了以下内容：

如何使用Selenium打开网页并获取页面源代码。
如何使用BeautifulSoup解析HTML内容。
如何提取并打印歌曲和专辑信息。

结论

通过这个项目，你学习了如何使用Python和Selenium抓取酷狗音乐网站上的歌曲信息，并使用BeautifulSoup解析HTML内容。希望这个项目对你有所帮助！欢迎在评论区留言。继续探索和学习，祝你在深度学习的旅程中取得更多的成果！

标签：album,title,Python,专辑,song,Selenium,url,源码,get
From： https://blog.csdn.net/m0_74972192/article/details/140834632

使用Python 和 Selenium 抓取酷狗音乐专辑附源码

依赖库

准备工作

步骤

第一步：初始化参数

第二步：URL参数编码

第三步：使用Selenium打开页面

第四步：解析HTML内容

第五步：打印结果

完整代码

代码解析：

运行结果：

相关类型推荐

爬虫项目推荐

其他项目推荐

扩展示例 1：保存歌曲信息到 CSV 文件

扩展示例 2：多线程抓取

扩展示例 3：使用代理

总结

结论

相关文章

赞助商

阅读排行

使用Python 和 Selenium 抓取 酷狗 音乐专辑 附源码

依赖库

准备工作

步骤

第一步：初始化参数

第二步：URL参数编码

第三步：使用Selenium打开页面

第四步：解析HTML内容

第五步：打印结果

完整代码

​

代码解析：

运行结果：

相关类型推荐

爬虫项目推荐

其他项目推荐

扩展示例 1：保存歌曲信息到 CSV 文件

扩展示例 2：多线程抓取

扩展示例 3：使用代理

总结

结论

相关文章

赞助商

阅读排行

使用Python 和 Selenium 抓取酷狗音乐专辑附源码