首页 > 其他分享 >【爬虫】项目篇-使用selenium、requests爬取天猫“华为手机”的商品评价

【爬虫】项目篇-使用selenium、requests爬取天猫“华为手机”的商品评价

时间:2024-05-07 14:24:19浏览次数:10  
标签:name 取天 selenium id import requests driver login

目录

使用selenium

from selenium.webdriver import Chrome,ChromeOptions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pymongo
import time
import random

class GetCookies():
    def GetCookies(self):
        username="ho"
        password="hong5"

        #login_url="https://login.tmall.com/"
        login_url="https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Flist.tmall.com%2Fsearch_product.htm%3Fq%3D%25E8%258B%25B9%25E6%259E%259C%26type%3Dp%26vmarket%3D%26spm%3D875.7931836%252FB.a2227oh.d100%26from%3Dmallfp..pc_1_searchbutton&uuid=9b1b940679de3c3820589302ff75920b"
        driver.get(login_url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="fm-login-id"]')))
        login_name=driver.find_element_by_xpath('//*[@id="fm-login-id"]')
        login_name.click()
        login_name.send_keys(username)

        time.sleep(random.randrange(5,7))

        login_passwd=driver.find_element_by_xpath('//*[@id="fm-login-password"]')
        login_passwd.click()
        login_passwd.send_keys(password)
        time.sleep(random.randrange(5,7))

        driver.find_element_by_xpath('//*[@id="login-form"]/div[4]/button').click()
        search_url="https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.1.193a74a7Gjoc16&id=656168531109&skuId=4902176242110&user_id=1917047079&cat_id=2&is_b=1&rn=96d6ce4c6e59b759d99176e5933c5e1f"
        driver.get(search_url)


class TamllComment():
    def GetCommentData(self):
        goods_url="https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.1.193a74a7Gjoc16&id=656168531109&skuId=4902176242110&user_id=1917047079&cat_id=2&is_b=1&rn=96d6ce4c6e59b759d99176e5933c5e1f"
        driver.get(goods_url)

        username="honey5730"
        password="hong12345"
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="fm-login-id"]')))
        login_name=driver.find_element_by_xpath('//*[@id="fm-login-id"]')
        login_name.click()
        login_name.send_keys(username)

        time.sleep(random.randrange(5,7))

        login_passwd=driver.find_element_by_xpath('//*[@id="fm-login-password"]')
        login_passwd.click()
        login_passwd.send_keys(password)
        time.sleep(random.randrange(5,7))

        driver.find_element_by_xpath('//*[@id="login-form"]/div[4]/button').click()



    def SaveAsMongo(self):
        pass

if __name__ == '__main__':

    options = ChromeOptions()
    options.add_argument(
        'user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"')
    driver = Chrome(options=options)
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })
    #获取cookies
    #cookies=GetCookies().GetCookies()
    cookies=[{'domain': '.taobao.com', 'expiry': 1653840540, 'httpOnly': False, 'name': 'tfstk', 'path': '/', 'secure': False, 'value': 'cQnABVw_LQA0_Bw-LqLoflTjbiphayBYstNOXLwWsq2FZdsOfs2mxDCKIEwaTSpR.'}, {'domain': '.taobao.com', 'expiry': 2269008541, 'httpOnly': False, 'name': 'cna', 'path': '/', 'sameSite': 'None', 'secure': True, 'value': ]

使用requests

import requests
from fake_useragent import UserAgent
import time
import random
import json
import redis
import openpyxl

def get_comment_data(start_page,end_page):

    url="https://rate.tmall.com/list_detail_rate.htm?"
    headers={
        'user-agent':UserAgent().ie,
        'cookie':'miid=4159704271564039423; cna=PMcSGlbGqDMCAXBvBSW1loSM; lid=honey5; t=d33712a517f185cc4bc07f7e794e1c6a; tracknick=honey5; .',
        'referer':'https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.1.47d174a7SwNZMX&id=656168531109&skuId=4902176242110&areaId=350100&user_id=1917047079&cat_id=2&is_b=1&rn=2a13cc7d543f8f0ff8e7d9492fc4d3b9'
    }

    while start_page<=end_page:
        params={
            'itemId':'656168531109',
            'sellerId':'1917047079',
            'order':'3',
            'currentPage':start_page
        }
        source=requests.get(url,headers=headers,params=params).text
        #print(source)

        #解析数据
        parse_comment_data(source)
        # with open('iphone%d.txt'%start_page,'w',encoding='utf-8') as file:
        #     file.write(source)
        time.sleep(random.randint(5, 8))
        start_page+=1

def parse_comment_data(source):

    comment_data=source.replace("jsonp128(","").replace(")","").replace("\n","")
    comment_data=json.loads(comment_data)
    for data in comment_data["rateDetail"]["rateList"]:

        #用户名
        username=data['displayUserNick']
        #商品类型
        goods_type=data['auctionSku']
        #评论
        content=data['rateContent']
        #日期
        date=data['rateDate']
        # 追加评论和日期
        try:
            add_content = data['appendComment']['content']
            add_content_date = data['appendComment']['commentTime']
        except:
            add_content = ""
            add_content_date=""
        print(username,goods_type,content,date,add_content,add_content_date)
        datalist.append([username,goods_type,content,date,add_content,add_content_date])


def save_as_redis(datalist):
    client = redis.Redis(host="localhost", port=6379, decode_responses=True, db=0)
    for data in datalist:
        data_dict = dict(zip(colnames, data))
        client.rpush('Tmall_iphone',data_dict)
    client.close()

def save_as_excel():
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.append(colnames)
    for data in datalist:
        ws.append(data)
    wb.save('Tmall_iphone.xlsx')
    wb.close()

if __name__ == '__main__':
    datalist=[]
    colnames=['用户名','商品类型','评论内容','日期','追评','追评日期']
    #爬取iphone 1-7页的评论
    get_comment_data(1,7)
    print(datalist)
    save_as_excel()


标签:name,取天,selenium,id,import,requests,driver,login
From: https://www.cnblogs.com/Gimm/p/18116447

相关文章

  • Selenium自动化测试——个人博客系统
    fromseleniumimportwebdriverfromselenium.common.exceptionsimportTimeoutExceptionfromselenium.webdriver.common.byimportByfromselenium.webdriver.common.keysimportKeysfromselenium.webdriver.support.uiimportWebDriverWaitfromselenium.webdri......
  • Python中出现"No module named 'requests'"的图文解决办法
    第一步第二步第三步第四步第五步 第六步总结第一步找到pycharm中的虚拟环境的位置第二步打开虚拟环境位置的文件夹 找到Scripts的这个文件夹然后复制该文件夹的地址第三步打开“运行”(可以用快捷键WIN+R键打开)然后输入cmd第四步切换目录到虚拟环境......
  • python+requests爬取B站视频保存到本地
    importosimportdatetimefromdjango.testimportTestCase#Createyourtestshere.importrequestsimportreimportjsonimportsubprocessfromconcurrent.futuresimportThreadPoolExecutordefdownload_video(url):#file_path='django3+dr......
  • selenium 未即时关闭引起的内存泄漏 差点死机
    seleniumwebdriverfirefox测试自动登录获取token,测试可以达到目的。然后日常摸鱼后发现浏览器快卡死了,切tty看top,任务没跑多少,内存倒是快榨干了,这不合理,也没有跑什么大内存程序,先把bt、sync给kill了,内存情况也没有太好转,于是看下内存占用,然后找内存占用高的进程,把这些占用高......
  • linux安装selenium步骤
    1,安装selenium模块pip3installselenium2,安装谷歌浏览器yuminstallhttps://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm-y3安装chromedriver1)运行下面命令查看浏览器版本google-chrome--version 出现这个代表谷歌浏览器安装成功2)谷歌......
  • 揭秘Python中的JSON数据格式与Requests模块
    From: https://mp.weixin.qq.com/s/QgGyn2efYtVKI65RwXoiEA------------------------------------------------------------------------------------importpytestimportjsonimportrequestsfromrequests.authimportHTTPBasicAuthfromrequests.adaptersimportHTTPA......
  • Chrome-selenium IDE UI自动化
         随着日益发展,自动化测试,逐渐成为测试工程师必要掌握条件之一,自动化测试分为UI自动化、接口自动化。本文这里介绍的WEB网页UI自动化,依托于浏览器插件实现UI自动化,纯小白都可以实现,无需代码功底。非常有趣的小工具。 首先,我们需要在插件市场下载一个插件,Chrome-se......
  • selenium操作中遇到iframe怎么办
    在Selenium中,如果你遇到了iframe(内联框架),你需要首先切换到该iframe的上下文中,然后才能定位到iframe内部的元素。这是因为iframe是一个独立的文档环境,Selenium默认只能定位到主文档的元素,无法直接定位到iframe内部的元素。 以下是如何在Selenium中定位和处理iframe的步骤:查找......
  • selenium中打开浏览器页面总是闪退
    代码如下:fromseleniumimportwebdriverbrowser=webdriver.Chrome()browser.get("http://www.baidu.com")#打开百度执行完后谷歌浏览器打开了,也没有报错,但会闪退,想要在页面查看需要定位的元素无法查看;这是因为selenium默认执行完所有代码后,会退出浏览器,并没有报错,不......
  • 获取天时分之间的时间间隔,返回天时分格式的日期,利用一天1440分钟
     写了老半天,还是电脑写的更简单,原来split还可以这么用,学到了。记录下//我写的publicstaticstringRetrieveSpanTimeByTime(stringfirstTime,stringendTime){intfirstDayIndex=firstTime.IndexOf("天");intendDayIndex=......