羊城地区HeyTea门店顾客评价数据可视化挖掘分析
开题背景
众所周知,HEYTEA,曾名“皇茶”,是一家由深圳美西西餐饮管理有限公司运营的中国连锁茶饮品牌。2012年,喜茶HEYTEA起源于江边里的一条小巷,为了与层出不穷的山寨品牌区分开来,故全面升级为注册品牌喜茶HEYTEA。由聂云宸于2012年创立,总部设在深圳市南山区航天科技广场,旗下拥有茶饮品牌“喜茶”和烘焙品牌“喜茶热麦”。
该公司依赖社交媒体上的口碑营销以减少广告支出。年轻消费者是其主要客户。大多数店面都在一二线城市的繁华购物中心。其知名产品为芝士奶盖。喜茶为芝士现泡茶的原创者。自创立以来,喜茶专注于呈现来自世界各地的优质茶香, 让这茶饮这一古老文化焕发出新的生命力。
在全国各大城市中,如此广为人知的一家奶茶店铺,我们自然会对其在广大消费者眼中的评价感到兴趣,这些评价对于慕名而来的消费者,也有着极高的参考价值。毕竟,一家企业的好和坏,并不能由单一的门店的产品质量来讨论,故而,这里在羊城地域,随机抽样择取了其所有门店中的部分顾客评价,并对其展开分析,希望得出的结论,可以对未来的消费者提供一份有价值的参考。
环境配置
!pip install pyLDAvis
!pip install wordcloud
# 读取/预处理数据库
import pandas as pd
import numpy as np
import time
# 可视化工具
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, ImageColorGenerator # 词云可视化
# 分词工具
import jieba
import jieba.posseg as psg
import re
# 文本特征提取工具
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# LDA主题模型可视化
import pyLDAvis
import pyLDAvis.sklearn
import paddlenlp
数据源文件如下:
数据包括评价文本及评价时间两个字段。
2 数据读取与预处理
2.1数据读取
data = pd.read_excel("data/data155443/xicha_meituan_data.xlsx")
# 词典文件路径 可以分特定词
dic_file = r"file/dict.txt"
stop_file = r"file/cn_stopwords.txt"
2.2数据预处理
data.info()
# 去除无内容的评价
data = data.dropna()
data.head(5)
2.2.1针对文本数据内容的预处理
# 分词函数
def chinese_word_cut(mytext):
jieba.load_userdict(dic_file)
jieba.initialize()
try:
stopword_list = open(stop_file,encoding ='utf-8')
except:
stopword_list = []
print("error in stop_file")
stop_list = []
flag_list = ['n','nz','vn','a','an','ad','i']
for line in stopword_list:
line = re.sub(u'\n|\\r', '', line)
stop_list.append(line)
word_list = []
#jieba分词
seg_list = psg.cut(mytext,use_paddle=True)
for seg_word in seg_list:
word = seg_word.word
find = 0
if seg_word.flag == 'x':
find=1
for stop_word in stop_list:
if stop_word == word :
find = 1
break
if find == 0 : # and seg_word.flag in flag_list
word_list.append(word)
return (" ").join(word_list)
# 计算分词后文本的词频
def count_frequencies(word_list):
freq = dict()
for sentence in word_list:
for w in sentence.split(' '):
if w not in freq.keys():
freq[w] = 1
else:
freq[w] += 1
return freq
data["cutted_comment"] = data.comment.apply(chinese_word_cut)
freq = count_frequencies(data["cutted_comment"])
freq = sorted(freq.items(),key=lambda x:x[1],reverse=True)
freq[:15]
len(freq)
# 去除过短的评论
t_data = data[data['cutted_comment'].str.len()>5]
t_data = t_data.reset_index().drop(columns='index')
2.2.2 评价时间字段数据预处理
t=time.localtime(1650933910610/1000)
time.strftime("%Y-%m-%d %H:%M:%S",t).split(' ')
_date=[]
_date_month=[]
_date_day=[]
_time=[]
_time_hour=[]
for i in t_data['comment_time']:
arr=time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(i/1000)).split(' ')
_date.append(arr[0])
_date_month.append(arr[0].split('-')[1])
_date_day.append(arr[0].split('-')[2])
_time.append(arr[1])
_time_hour.append(arr[1].split(':')[0])
t_data.insert(2,value=_date_month,column='_date_month')
t_data.insert(3,value=_date_day,column='_date_day')
t_data.insert(4,value=_time_hour,column='_time_hour')
t_data.head(5)
2.2.3 基于paddleNLP.TaskFlow评价情感分析
sa = paddlenlp.Taskflow('sentiment_analysis')
sentiment_analysis_data = sa(t_data['comment'].to_list())
sentiment_analysis_data_df = pd.DataFrame(sentiment_analysis_data)
sentiment_analysis_data_df.head(5)
t_data['label'] = sentiment_analysis_data_df['label']
t_data['score'] = sentiment_analysis_data_df['score']
# 保存
t_data.to_excel('file/data.xlsx',index=False,encoding='utf8')
t_data.head(5)
t_data.info()
抽样区域内门店评价数据可视化实现
2.1店内图文及词云可视化显示
def plt_imshow(x, ax=None, show=True):
if ax is None:
fig, ax = plt.subplots()
ax.imshow(x)
ax.axis("off")
if show: plt.show()
return ax
wcd = WordCloud(background_color='white',
max_words=200,
font_path = 'file/font/SIMHEI.TTF',
mode="RGBA",
max_font_size=180,
width = 1300,
height = 800,
scale=10
)
wcd.generate_from_frequencies(dict(freq))
ax=plt_imshow(wcd)
ax.figure.savefig('file/pic/wordcloud0.png',dpi=1000)
我们可以得知:利用WordCloud库和处理后的评价数据进行词云图制作,得到下图,可以看到“好喝”、“不错”、“味道”等词在评价中经常出现,可以体现出顾客在喜茶门店具有较好的服务感知满意度;
“芝士”、“葡萄”、“芒果”、“草莓”等词也的出现频率也是不低,对此,我们可知:服务产品中水果类饮品较多样顾客对该类商品有较深的印象;
“环境”、“口味”、“味道”、“清爽”等词,则形象生动地表明了顾客对门店服务支持设施和显性服务有一定的关注。
plt.figure(figsize=(19,6))
ax=sns.countplot(
t_data['_date_month'],
saturation =1,
palette=sns.color_palette(palette='OrRd_d',desat=0.9,n_colors=12),
)
plt.title('Number of monthly reviews of HEYTEA',fontsize=30)
plt.xlabel('Month',fontsize=30)
plt.ylabel("Count",fontsize=30)
plt.savefig('file/pic/每月评价数.png',dpi=1000)
根据这一则数据我们可知:喜茶评价数据大多在上半年3到6月份,下半年评价数据较少。因此,我们可以推断出:春季是奶茶饮品的旺季,而从盛夏至严冬的时段,这一类产品显然不如春季之时受消费者欢迎。
plt.figure(figsize=(19,6))
ax=sns.countplot(
t_data['_date_day'],
saturation =1,
palette=sns.color_palette(palette='PuBu',desat=1,n_colors=10)
)
plt.title('Number of Daily reviews of HEYTEA',fontsize=30)
plt.xlabel('Day',fontsize=30)
plt.ylabel("Count",fontsize=30)
plt.savefig('file/pic/一个月中每日评价数.png',dpi=1000)
我们进一步对喜茶12个月中每天评价数据进行统计,则可以得到下图。如图所示,在每月12号、18号和29号评价数据较多;在1号、11号和19号较少,对此,相应调整店内原材料以及人员的配比,也会是提高营业效益的一种方式。
plt.figure(figsize=(19,6))
ax=sns.countplot(
t_data['_time_hour'],
saturation=1,
palette=sns.color_palette(palette='OrRd_d',desat=1)
)
plt.title('Number of Hours reviews of HEYTEA',fontsize=30)
plt.xlabel('Every hour of the day',fontsize=30)
plt.ylabel("Count",fontsize=30)
plt.savefig('file/pic/一个天中24小时评价数.png',dpi=1000)
# 消极评论每天各个时段散点图
plt.figure(figsize=(12,9))
sns.swarmplot(y=t_data['_time_hour'],x=t_data['label'])
plt.title('HEYTEA Distribution of good and bad reviews in various time periods of the day',fontsize=15)
plt.xlabel('Label',fontsize=20)
plt.ylabel('per hours of day',fontsize=20)
plt.savefig('file/pic/一天内各个时段好差评数据分布.png',dpi=1000)
# 消极评论每个月散点图
plt.figure(figsize=(12,9))
sns.swarmplot(y=t_data['_date_month'],x=t_data['label'])
# 消极评论每个月内散点图
plt.figure(figsize=(12,9))
sns.swarmplot(y=t_data['_date_day'],x=t_data['label'])
sns.countplot(t_data['label'])
plt.title(r"HEYTEA Negative vs. positive reviews",fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xlabel('HEYTEA Comment Label',fontsize=15)
plt.savefig('file/pic/喜茶评价积极与消极比较.png',dpi=1000)
ax=sns.barplot(
y=[len(t_data['comment']),len(t_data['comment'].unique())],
x=['all_comment','unique_comment'],
)
plt.title('HEYTEA comments''s uniqueness',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.savefig('file/pic/喜茶评价唯一性.png',dpi=1000)
ax=sns.barplot(
y=[len(t_data['comment']),len(t_data['comment'].unique())],
x=['all_comment','unique_comment'],
)
plt.title('HEYTEA comments''s uniqueness',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.savefig('file/pic/喜茶评价唯一性.png',dpi=1000)
print('消极评论占比',sum(t_data['label']=='negative')/len(t_data))
# 消极评论词云图
freq_negative = count_frequencies(t_data[t_data['label']=='negative']['cutted_comment'])
freq_negative = sorted(freq_negative.items(),key=lambda x:x[1],reverse=True)
freq_negative[:15]
wcd = WordCloud(background_color='white',
max_words=150,
font_path = 'file/font/SIMHEI.TTF',
mode="RGBA",
max_font_size=160,
width = 1000,
height = 500,
scale=10)
wcd.generate_from_frequencies(dict(freq_negative[4:]))
ax=plt_imshow(wcd)
ax.figure.savefig('file/pic/wordcloud_negative.png',dpi=1000)
挖掘——LDA模型主题
3.1提取文本特征
n_features = 1000 #提取1000个特征词语
cv_vectorizer = CountVectorizer(max_df = 0.90,
max_features=n_features,
min_df = 5)
cv = cv_vectorizer.fit_transform(t_data.cutted_comment)
# tfidf提取文本特征
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95,
max_features=500)
tfidf = tfidf_vectorizer.fit_transform(t_data.cutted_comment)
hash_vectorizer = HashingVectorizer()
hashing = hash_vectorizer.fit_transform(data.cutted_comment)
3.2 LDA困惑度与主题数选择
plexs = []
n_max_topics = 50
for i in range(1,n_max_topics):
print(i,end='..')
lda = LatentDirichletAllocation(n_components=i,
doc_topic_prior=1/i,
topic_word_prior=1/i,
learning_method='batch')
lda.fit(cv)
plexs.append(lda.perplexity(cv))
n_t=50#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,plexs[:n_t], linestyle='-.')
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.legend(('perplexity_value'))
plt.title("LDA Model Perplexity Value")
plt.savefig('file/pic/PerplexityValue.png',dpi=1000)
plt.show()
3.3 LDA模型训练与模型结果
n_topics = 4
lda = LatentDirichletAllocation(n_components=n_topics,max_iter=100,
doc_topic_prior=1/n_topics,
topic_word_prior=0.01,
learning_method='batch')
lda.fit(cv)
def print_top_words(model, feature_names, n_top_words):
tword = []
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
tword.append(topic_w)
print(topic_w)
return tword
# 输出每个主题及其主题词
n_top_words = 15
cv_vectorizer_name = cv_vectorizer.get_feature_names()
topic_word = print_top_words(lda, cv_vectorizer_name, n_top_words)
# LDA模型还可以用于文本分类
topics=lda.transform(cv)
topic = []
for t in topics:
topic.append("Topic #"+str(list(t).index(np.max(t))))
t_data['概率最大的主题序号']=topic
t_data['每个主题对应概率']=list(topics)
t_data.to_excel("data_topic_test0.xlsx",index=False)
3.4 LDA模型结果可视化
pyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, cv,cv_vectorizer)
pyLDAvis.display(pic)
pyLDAvis.save_html(pic, 'file/pic/lda_pass'+str(n_topics)+'.html')
pyLDAvis.display(pic)
#去工作路径下找保存好的html文件
在文件下找到生成的html文件,用浏览器打开可以看到具体结果
运行结果如图所示:
3.5测试调查结果
结合LDA主题挖掘与情感分析统计的结果,可以得出hey Tea在羊城(广州)地区的门店主要的运营问题是排队时间过长的结论,同时可以推断该问题在中午的13点与下午的17点经常频发。
完整代码附上:
!pip install pyLDAvis
!pip install wordcloud
# 读取/预处理数据库
import pandas as pd
import numpy as np
import time
# 可视化工具
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, ImageColorGenerator # 词云可视化
# 分词工具
import jieba
import jieba.posseg as psg
import re
# 文本特征提取工具
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# LDA主题模型可视化
import pyLDAvis
import pyLDAvis.sklearn
import paddlenlp
data = pd.read_excel("data/data155443/xicha_meituan_data.xlsx")
# 词典文件路径 可以分特定词 如
dic_file = r"file/dict.txt"
stop_file = r"file/cn_stopwords.txt"
data.info()
# 去除无内容的评价
data = data.dropna()
data.head(5)
# 分词函数
def chinese_word_cut(mytext):
jieba.load_userdict(dic_file)
jieba.initialize()
try:
stopword_list = open(stop_file,encoding ='utf-8')
except:
stopword_list = []
print("error in stop_file")
stop_list = []
flag_list = ['n','nz','vn','a','an','ad','i']
for line in stopword_list:
line = re.sub(u'\n|\\r', '', line)
stop_list.append(line)
word_list = []
#jieba分词
seg_list = psg.cut(mytext,use_paddle=True)
for seg_word in seg_list:
word = seg_word.word
find = 0
if seg_word.flag == 'x':
find=1
for stop_word in stop_list:
if stop_word == word :
find = 1
break
if find == 0 : # and seg_word.flag in flag_list
word_list.append(word)
return (" ").join(word_list)
# 计算分词后文本的词频
def count_frequencies(word_list):
freq = dict()
for sentence in word_list:
for w in sentence.split(' '):
if w not in freq.keys():
freq[w] = 1
else:
freq[w] += 1
return freq
data["cutted_comment"] = data.comment.apply(chinese_word_cut)
freq = count_frequencies(data["cutted_comment"])
freq = sorted(freq.items(),key=lambda x:x[1],reverse=True)
freq[:15]
len(freq)
# 去除过短的评论
t_data = data[data['cutted_comment'].str.len()>5]
t_data = t_data.reset_index().drop(columns='index')
# 例子
t=time.localtime(1650933910610/1000)
time.strftime("%Y-%m-%d %H:%M:%S",t).split(' ')
_date=[]
_date_month=[]
_date_day=[]
_time=[]
_time_hour=[]
for i in t_data['comment_time']:
arr=time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(i/1000)).split(' ')
_date.append(arr[0])
_date_month.append(arr[0].split('-')[1])
_date_day.append(arr[0].split('-')[2])
_time.append(arr[1])
_time_hour.append(arr[1].split(':')[0])
t_data.insert(2,value=_date_month,column='_date_month')
t_data.insert(3,value=_date_day,column='_date_day')
t_data.insert(4,value=_time_hour,column='_time_hour')
t_data.head(5)
sa = paddlenlp.Taskflow('sentiment_analysis')
sentiment_analysis_data = sa(t_data['comment'].to_list())
sentiment_analysis_data_df = pd.DataFrame(sentiment_analysis_data)
sentiment_analysis_data_df.head(5)
t_data['label'] = sentiment_analysis_data_df['label']
t_data['score'] = sentiment_analysis_data_df['score']
# 保存一下
t_data.to_excel('file/data.xlsx',index=False,encoding='utf8')
t_data.head(5)
t_data.info()
def plt_imshow(x, ax=None, show=True):
if ax is None:
fig, ax = plt.subplots()
ax.imshow(x)
ax.axis("off")
if show: plt.show()
return ax
wcd = WordCloud(background_color='white',
max_words=200,
font_path = 'file/font/SIMHEI.TTF',
mode="RGBA",
max_font_size=180,
width = 1300,
height = 800,
scale=10
)
wcd.generate_from_frequencies(dict(freq))
ax=plt_imshow(wcd)
ax.figure.savefig('file/pic/wordcloud0.png',dpi=1000)
plt.figure(figsize=(19,6))
ax=sns.countplot(
t_data['_date_month'],
saturation =1,
palette=sns.color_palette(palette='OrRd_d',desat=0.9,n_colors=12),
)
plt.title('Number of monthly reviews of HEYTEA',fontsize=30)
plt.xlabel('Month',fontsize=30)
plt.ylabel("Count",fontsize=30)
plt.savefig('file/pic/每月评价数.png',dpi=1000)
plt.figure(figsize=(19,6))
ax=sns.countplot(
t_data['_date_day'],
saturation =1,
palette=sns.color_palette(palette='PuBu',desat=1,n_colors=10)
)
plt.title('Number of Daily reviews of HEYTEA',fontsize=30)
plt.xlabel('Day',fontsize=30)
plt.ylabel("Count",fontsize=30)
plt.savefig('file/pic/一个月中每日评价数.png',dpi=1000)
plt.figure(figsize=(19,6))
ax=sns.countplot(
t_data['_time_hour'],
saturation=1,
palette=sns.color_palette(palette='OrRd_d',desat=1)
)
plt.title('Number of Hours reviews of HEYTEA',fontsize=30)
plt.xlabel('Every hour of the day',fontsize=30)
plt.ylabel("Count",fontsize=30)
plt.savefig('file/pic/一个天中24小时评价数.png',dpi=1000)
# 消极评论每天各个时段散点图
plt.figure(figsize=(12,9))
sns.swarmplot(y=t_data['_time_hour'],x=t_data['label'])
plt.title('HEYTEA Distribution of good and bad reviews in various time periods of the day',fontsize=15)
plt.xlabel('Label',fontsize=20)
plt.ylabel('per hours of day',fontsize=20)
plt.savefig('file/pic/一天内各个时段好差评数据分布.png',dpi=1000)
# 消极评论每个月散点图
plt.figure(figsize=(12,9))
sns.swarmplot(y=t_data['_date_month'],x=t_data['label'])
# 消极评论每个月内散点图
plt.figure(figsize=(12,9))
sns.swarmplot(y=t_data['_date_day'],x=t_data['label'])
sns.countplot(t_data['label'])
plt.title(r"HEYTEA Negative vs. positive reviews",fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xlabel('HEYTEA Comment Label',fontsize=15)
plt.savefig('file/pic/喜茶评价积极与消极比较.png',dpi=1000)
ax=sns.barplot(
y=[len(t_data['comment']),len(t_data['comment'].unique())],
x=['all_comment','unique_comment'],
)
plt.title('HEYTEA comments''s uniqueness',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.savefig('file/pic/喜茶评价唯一性.png',dpi=1000)
print('消极评论占比',sum(t_data['label']=='negative')/len(t_data))
# 消极评论词云图
freq_negative = count_frequencies(t_data[t_data['label']=='negative']['cutted_comment'])
freq_negative = sorted(freq_negative.items(),key=lambda x:x[1],reverse=True)
freq_negative[:15]
wcd = WordCloud(background_color='white',
max_words=150,
font_path = 'file/font/SIMHEI.TTF',
mode="RGBA",
max_font_size=160,
width = 1000,
height = 500,
scale=10)
wcd.generate_from_frequencies(dict(freq_negative[4:]))
ax=plt_imshow(wcd)
ax.figure.savefig('file/pic/wordcloud_negative.png',dpi=1000)
n_features = 1000 #提取1000个特征词语
cv_vectorizer = CountVectorizer(max_df = 0.90,
max_features=n_features,
min_df = 5)
cv = cv_vectorizer.fit_transform(t_data.cutted_comment)
# tfidf提取文本特征
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95,
max_features=500)
tfidf = tfidf_vectorizer.fit_transform(t_data.cutted_comment)
hash_vectorizer = HashingVectorizer()
hashing = hash_vectorizer.fit_transform(data.cutted_comment)
plexs = []
n_max_topics = 50
for i in range(1,n_max_topics):
print(i,end='..')
lda = LatentDirichletAllocation(n_components=i,
doc_topic_prior=1/i,
topic_word_prior=1/i,
learning_method='batch')
lda.fit(cv)
plexs.append(lda.perplexity(cv))
n_t=50#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,plexs[:n_t], linestyle='-.')
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.legend(('perplexity_value'))
plt.title("LDA Model Perplexity Value")
plt.savefig('file/pic/PerplexityValue.png',dpi=1000)
plt.show()
n_topics = 4
lda = LatentDirichletAllocation(n_components=n_topics,max_iter=100,
doc_topic_prior=1/n_topics,
topic_word_prior=0.01,
learning_method='batch')
lda.fit(cv)
def print_top_words(model, feature_names, n_top_words):
tword = []
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
tword.append(topic_w)
print(topic_w)
return tword
# 输出每个主题及其主题词
n_top_words = 15
cv_vectorizer_name = cv_vectorizer.get_feature_names()
topic_word = print_top_words(lda, cv_vectorizer_name, n_top_words)
# LDA模型还可以用于文本分类
topics=lda.transform(cv)
topic = []
for t in topics:
topic.append("Topic #"+str(list(t).index(np.max(t))))
t_data['概率最大的主题序号']=topic
t_data['每个主题对应概率']=list(topics)
t_data.to_excel("data_topic_test0.xlsx",index=False)
pyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, cv,cv_vectorizer)
pyLDAvis.display(pic)
pyLDAvis.save_html(pic, 'file/pic/lda_pass'+str(n_topics)+'.html')
pyLDAvis.display(pic)
#去工作路径下找保存好的html文件
标签:plt,word,list,羊城,可视化,file,fontsize,heyTea,data From: https://www.cnblogs.com/vm99/p/16997512.html