爬取景点评论

准备工作

确定需要爬取的景点，得到其网址和对应ID。

postUrl="https://m.ctrip.com/restapi/soa2/13444/json/getCommentCollapseList"

urls = [['93527','思溪延村']]

获取HTML页面

这个JSON字典里的参数并不总是一样，一般要修改commentTagId，可以打开

id = urls[0]
data_pre={
        "arg":{
            "channelType":2,
            "collapseType":0,
            "commentTagId":-19,
            "pageIndex":1,
            "pageSize":10,
            "poiId":id[0],
            "sourceType":1,
            "sortType":3,
            "starType":0
        },
        "head":{
            "cid":"09031073210610739886",
            "ctok":"",
            "cver":"1.0",
            "lang":"01",
            "sid":"8888",
            "syscode":"09",
            "auth":"",
            "xsid":"",
            "extension":[]
        }
    }

html=requests.post(postUrl,data=json.dumps(data_pre)).text
html=json.loads(html)

解析处理

# 确定总页数总页数
total_page = int(html['result']['totalCount']) / 10
total_page = int(math.ceil(total_page))
# 遍历查询评论
print("总页数:", total_page, "爬取中")
total_page = 30

# 创建写入csv文件
path = str(id[1]) + '.txt'
index = 0
with open(path, 'w', newline='', encoding='utf-8') as f:
#     file = csv.writer(f)
#     file.writerow(['序号', '景区ID', '景区名称', '评论'])
    for page in range(1, total_page+1):
        data={
            "arg":{
                "channelType":2,
                "collapseType":0,
                "commentTagId":-19,
                "pageIndex":page,
                "pageSize":10,
                "poiId":id[0],
                "sourceType":1,
                "sortType":3,
                "starType":0
            },
            "head":{
                "cid":"09031073210610739886",
                "ctok":"",
                "cver":"1.0",
                "lang":"01",
                "sid":"8888",
                "syscode":"09",
                "auth":"",
                "xsid":"",
                "extension":[]
            }
        }
        html=requests.post(postUrl,data=json.dumps(data)).text
        html=json.loads(html)
        # 获取评论
        for j in range(0,10):
            try:
                result = html['result']['items'][j]['content']
#                 file.writerow([index, page, result])
                f.write('{} {}'.format(index, result))
                f.write('\n')
                index += 1
            except:
                print("Error raised! Please look for page {} review {}.".format(page, j))

worldcloud

mask = imageio.imread('circle.jpg')
f = open('思溪延村.txt', 'r',encoding = 'utf-8')
t = f.read()

f.close()

ls = jieba.lcut(t)
for i in range(len(ls)):
    if len(ls[i]) == 1:
        ls[i] = ' '
    if ls[i] in ['有点', '一些', '几个', '这个', '一个']:
        ls[i] = ''
txt = ' '.join(ls)
w = wordcloud.WordCloud(font_path="msyh.ttc",mask = mask, \
    width=1000, height=700, background_color='white',max_words=60)
w.generate(txt)
w.to_file('sixiyancun.png')

标签：旅游景点,Python,爬虫,爬取,html,ls,total,data,page
From： https://www.cnblogs.com/coco02/p/16901855.html

Python Pickle 与 JSON
1.PythonPickle和JSON之间的区别很全面很棒的解释。2.一篇使用说明:https://janakiev.com/blog/python-pickle-json/3.mmdetection3d中的说明https://githu......
排序算法Python
冒泡排序defbubbleSort(nums):iflen(nums)<=1:returnnumsforiinrange(len(nums)-1):forjinrange(len(nums)-i-1):......
python基础语法知识
1、多组输入没有结束标志的两种表示形式#method1:try:whileTrue:#代码exceptEOFError:pass #method2:whileTrue:try:#代码......
python-单例-笔记
目标单例设计模式__new__ 方法Python中的单例01.单例设计模式设计模式设计模式是前人工作的总结和提炼，通常，被人们广泛流传的设计模式都是针对某一特定问题......
python-异常-笔记
目标异常的概念捕获异常异常的传递抛出异常01.异常的概念程序在运行时，如果 Python解释器遇到到一个错误，会停止程序的执行，并且提示一些错误信息，这就是异常......
python-模块和包-笔记
目标模块包发布模块01.模块1.1模块的概念模块是Python程序架构的一个核心概念每一个以扩展名 py 结尾的 Python 源代码文件都是一个模块模块名 ......
python-文件-笔记
目标文件的概念文件的基本操作文件/文件夹的常用操作文本文件的编码方式01.文件的概念1.1文件的概念和作用计算机的文件，就是存储在某种长期储存设备上的一段数据......
python-eval 函数-笔记
eval（) 函数十分强大—— 将字符串当成有效的表达式来求值并返回计算结果#基本的数学计算In[1]:eval（"1+1")Out[1]:2#字符串重复In[2]:eval（"'*'......
20221117-python-条件判断
1.浅拷贝与深拷贝 2.分支语句 ......
学习python-89
今日学习内容celery介绍它是一个异步任务提交的框架作用：完成异步任务：提高项目的并发量。之前开启线程做，现在用celery做。完成延迟任务完成定时任务架构：消息中......

Python爬虫之旅游景点评论

爬取景点评论

准备工作

获取HTML页面

解析处理

worldcloud

相关文章

赞助商

阅读排行