自动化爬取Behance网站上的项目链接（优化版）

### 代码功能：

此代码的主要目的是自动化爬取Behance网站上的项目链接。

1. **多关键词搜索**: 用户可以一次性输入多个关键词，程序会为每个关键词爬取指定数量的项目链接。
2. **自动滚动页面**: 使用Selenium模拟浏览器操作，程序能自动地滚动页面以获取更多的链接。
3. **命令行界面交互**: 代码允许用户在运行时实时与之交互，输入`stop`指令以跳过某个关键词的爬取。

### 达成的目的：

1. **高效率**: 该程序自动化了浏览并从Behance爬取项目链接的过程，避免了手动搜索和复制链接的繁琐。
2. **灵活性**: 用户可以按需选择要爬取链接的关键词和数量，还可以随时停止爬取过程。

### 使用注意事项：

1. **环境准备**: 在运行此代码之前，确保您已正确安装了Selenium库和ChromeDriver。
2. **登录Behance**: 当程序运行并打开浏览器后，首先针对第一个关键词进行搜索。在这一阶段，您可以手动登录Behance（如果需要），然后输入`stop`继续爬取进程。
3. **文件保存路径**: 当询问文件夹路径时，确保输入一个有效的目录路径。如果目录不存在，程序会为您创建它。

### 容易出错的地方：

1. **Selenium环境**: 如果您没有正确设置Selenium和ChromeDriver，代码会出错。确保两者都已正确安装和配置。
2. **网页结构改变**: Behance的网页结构可能会变化。如果这发生，CSS选择器可能不再有效，需要更新代码以适应新的结构。
3. **请求限制**: 如果短时间内发送了太多请求，Behance可能会限制您的IP地址。确保不要过于频繁地运行此程序。
4. **滚动延迟**: 设置的滚动延迟（当前为3秒）可能不适用于所有计算机和网络速度。太快的滚动可能导致新的内容没有足够的时间加载，而太慢的滚动会浪费时间。

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import threading
import os

stop_current_keyword = False # 控制是否停止当前关键词的爬取

# 后台线程，用于监听用户输入
def input_thread():
global stop_current_keyword
while True:
cmd = input("Type 'stop' to skip current keyword: ")
if cmd == 'stop':
stop_current_keyword = True
break

def get_behance_links(driver, url, target_count):
links = set()
driver.get(url)
time.sleep(5) # 初始加载页面需要一些时间

no_new_links_duration = 0

try:
while len(links) < target_count and not stop_current_keyword:
previous_link_count = len(links)

a_elements = driver.find_elements(By.CSS_SELECTOR, ".ProjectCoverNeue-coverLink-U39.e2e-ProjectCoverNeue-link")
for a in a_elements:
href = a.get_attribute("href")
if href:
links.add(href)

if len(links) == previous_link_count:
no_new_links_duration += 3 # 更新没有新链接的持续时间
else:
no_new_links_duration = 0

if no_new_links_duration >= 10:
break

if len(links) < target_count:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3) # 每3秒拉动页面一次
except Exception as e:
print(f"Error: {e}")

return list(links)

if __name__ == "__main__":
keywords = [
"washing machine",
"refrigerator",
# ... 更多关键词 ...
]

target_links_count = int(input("请输入您想要爬取的链接数量："))

directory = input("请输入保存链接的文件夹路径（例如：C:/Users/YourName/Desktop/behance产品链接/）：").strip()
directory = directory.replace("\\", "/")
if not directory.endswith("/"):
directory += "/"
if not os.path.exists(directory):
os.makedirs(directory)

thread = threading.Thread(target=input_thread)
thread.start()

driver = webdriver.Chrome()

try:
for index, keyword in enumerate(keywords):
stop_current_keyword = False

safe_keyword = keyword.replace("/", "_").replace("\\", "_")
search_url = f"https://www.behance.net/search/projects?search={keyword}&field=industrial+design"

if index == 0:
driver.get(search_url)
print(f"Crawling for keyword '{keyword}'. Input 'stop' when you're ready to move to the next keyword.")
while not stop_current_keyword:
time.sleep(1)
else:
links = get_behance_links(driver, search_url, target_links_count)
filename = os.path.join(directory, f"{safe_keyword}.txt")
with open(filename, 'w', encoding='utf-8') as f:
for link in links:
f.write(link + '\n')
print(f"已找到 {len(links)} 个链接并保存到 '{filename}'.")

if stop_current_keyword:
print(f"Skipped the rest of '{keyword}' on user command.")
finally:
driver.quit()

标签：links,keyword,stop,爬取,Behance,链接
From： https://www.cnblogs.com/zly324/p/17746130.html

自动化爬取Behance网站上的项目链接（优化版）

相关文章

赞助商

阅读排行