首页 > 编程语言 >python练习10

python练习10

时间:2023-05-16 11:36:22浏览次数:34  
标签:comment 10 python 练习 regex words print line

python练习10

豆瓣图书评论数据分析与可视化.py

import re
from collections import Counter

import requests
from lxml import etree
import pandas as pd
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39"
}

comments = []
words = []


def regex_change(line):
# 前缀的正则2
username_regex = re.compile(r"^\d+::")
# URL,为了防止对中文的过滤,所以使用[a-zA-Z0-9]而不是\w
url_regex = re.compile(r"""
(https?://)?
([a-zA-Z0-9]+)
(\.[a-zA-Z0-9]+)
(\.[a-zA-Z0-9]+)*
(/[a-zA-Z0-9]+)*
""", re.VERBOSE | re.IGNORECASE)
# 剔除日期
data_regex = re.compile(u""" #utf-8编码
年 |
月 |
日 |
(周一) |
(周二) |
(周三) |
(周四) |
(周五) |
(周六)
""", re.VERBOSE)
# 剔除所有数字
decimal_regex = re.compile(r"[^a-zA-Z]\d+")
# 剔除空格
space_regex = re.compile(r"\s+")
regEx = "[\n”“|,,;;''! 。的了是]" # 去除字符串中的换行符、中文冒号、|,需要去除什么字符就在里面写什么字符
line = re.sub(regEx, "", line)
line = username_regex.sub(r"", line)
line = url_regex.sub(r"", line)
line = data_regex.sub(r"", line)
line = decimal_regex.sub(r"", line)
line = space_regex.sub(r"", line)
return line


def getComments(url):
score = 0
resp = requests.get(url, headers=headers).text
html = etree.HTML(resp)
comment_list = html.xpath(".//div[@class='comment']")
for comment in comment_list:
status = ""
name = comment.xpath(".//span[@class='comment-info']/a/text()")[0] # 用户名
content = comment.xpath(".//p[@class='comment-content']/span[@class='short']/text()")[0] # 短评内容
content = str(content).strip()
word = jieba.cut(content, cut_all=False, HMM=False)
time = comment.xpath(".//span[@class='comment-info']/a/text()")[1] # 评论时间
mark = comment.xpath(".//span[@class='comment-info']/span/@title") # 评分
if len(mark) == 0:
score = 0
else:
for i in mark:
status = str(i)
if status == "力荐":
score = 5
elif status == "推荐":
score = 4
elif status == "还行":
score = 3
elif status == "较差":
score = 2
elif status == "很差":
score = 1
good = comment.xpath(".//span[@class='comment-vote']/span[@class='vote-count']/text()")[0] # 点赞数(有用数)
comments.append([str(name), content, str(time), score, int(good)])
for i in word:
if len(regex_change(i)) >= 2:
words.append(regex_change(i))


def getWordCloud(words):
# 生成词云
all_words = []
all_words += [word for word in words]
dict_words = dict(Counter(all_words))
bow_words = sorted(dict_words.items(), key=lambda d: d[1], reverse=True)
print("热词前10位:")
for i in range(10):
print(bow_words[i])
text = ' '.join(words)

w = WordCloud(background_color='white',
width=1000,
height=700,
font_path='simhei.ttf',
margin=10).generate(text)
plt.show()
plt.imshow(w)
w.to_file('wordcloud.png')


print("请选择以下选项:")
print(" 1.热门评论")
print(" 2.最新评论")
info = int(input())
print("前10位短评信息:")
title = ['用户名', '短评内容', '评论时间', '评分', '点赞数']
if info == 1:
comments = []
words = []
for i in range(0, 60, 20):
url = "https://book.douban.com/subject/10517238/comments/?start={}&limit=20&status=P&sort=new_score".format(
i) # 前3页短评信息(热门)
getComments(url)
df = pd.DataFrame(comments, columns=title)
print(df.head(10))
print("点赞数前10位的短评信息:")
df = df.sort_values(by='点赞数', ascending=False)
print(df.head(10))
getWordCloud(words)
elif info == 2:
comments = []
words=[]
for i in range(0, 60, 20):
url = "https://book.douban.com/subject/10517238/comments/?start={}&limit=20&status=P&sort=time".format(
i) # 前3页短评信息(最新)
getComments(url)
df = pd.DataFrame(comments, columns=title)
print(df.head(10))
print("点赞数前10位的短评信息:")
df = df.sort_values(by='点赞数', ascending=False)
print(df.head(10))
getWordCloud(words)

 

标签:comment,10,python,练习,regex,words,print,line
From: https://www.cnblogs.com/yunbianshangdadun/p/17404410.html

相关文章

  • python练习
    函数图形1绘制.pyimportrequestsfrombs4importBeautifulSoupasbsimportpandasaspdfrommatplotlibimportpyplotaspltdefget_rank(url):count=0rank=[]headers={"user-agent":"Mozilla/5.0(WindowsNT10.0;Win64;x64......
  • python练习8
    函数图形2绘制.pyimportnumpyasnpimportmatplotlib.pyplotasplt#定义函数deff1(x):returnx**2deff2(x):returnnp.cos(2*x)deff3(x):returnf1(x)*f2(x)#生成X轴数据x=np.linspace(0,10,500)#绘制函数图形plt.plot(x,f1(x),'b-.',label=......
  • 10-Servlet
    1.简介Servlet是JavaWeb最为核心的内容,它是Java提供的一门动态web资源开发技术。使用Servlet就可以实现,根据不同的登录用户在页面上动态显示不同内容。Servlet是JavaEE规范之一,其实就是一个接口,将来我们需要定义Servlet类实现Servlet接口,并由web服务......
  • python练习5
    python练习5importmathclassPoint():def__init__(self,x,y):self.x=xself.y=ydef__lt__(self,other):l1=math.sqrt(self.x**2+self.y**2)l2=math.sqrt(other.x**2+other.y**2)returnl1<l2def__le__(sel......
  • python练习4
    python练习4classCexception:def__init__(self,year,month,day):ifCexception.judge(year,month,day):self.year=yearself.month=monthself.day=dayelse:self.year=-1self.mont......
  • python学生管理系统笔记(基础框架)
     1.LoginPage.pyimporttkinterastkfromtkinterimportmessageboxfromdbimportdbfromMainPageimportMainPageclassLoginPage:def__init__(self,master):self.root=masterself.root.geometry('300x180')se......
  • 1020 Tree Traversals
    题目:Supposethatallthekeysinabinarytreearedistinctpositiveintegers.Giventhepostorderandinordertraversalsequences,youaresupposedtooutputthelevelordertraversalsequenceofthecorrespondingbinarytree.InputSpecification:Eachi......
  • 频谱仪设计基于FPGA的频谱仪设计,可以测试分析多种频率的频谱,分辨率100HZ,配套资料多达1
    频谱仪设计基于FPGA的频谱仪设计,可以测试分析多种频率的频谱,分辨率100HZ,配套资料多达100M,东西复杂ID:982500594354361311......
  • 4机10节点系统暂态稳定性仿真/Simulink仿真 1.基于MATLAB/Simulink
    4机10节点系统暂态稳定性仿真/Simulink仿真1.基于MATLAB/Simulink平台搭建4机10节点系统仿真模型,可以仿真单相接地/两相相间短路/两相接地短路/三相短路故障情况下系统的暂态特性。2.研究电力系统稳定器(powersystemstabilizer)PSS和静止无功补偿器(staticvarcompensator)SVC......
  • Python多线程并发通用模板
    多线程可以同时处理多个任务,支持并发处理,从而提高系统的并发能力。多线程爬虫的好处主要有提高爬取效率、提高稳定性、节省资源等。总之,多线程爬虫可以提高爬取效率、稳定性和资源利用率,是一种更加高效、可靠的爬虫实现方式。多线程爬虫并行可以提高爬虫的效率,具体实现方法如下:......