首页 > 其他分享 >基于酒店文本描述来推荐相似酒店

基于酒店文本描述来推荐相似酒店

时间:2024-03-14 15:34:58浏览次数:22  
标签:02 03 酒店 ... 文本 words desc Seattle 描述

在旅行规划中,选择合适的酒店是一个重要的决策。然而,面对众多的酒店选择,如何找到与个人偏好相匹配的酒店成为一个挑战。本文将介绍如何构建一个基于描述内容相似度的酒店推荐系统,通过分析Seattle_Hotels数据集,为用户提供个性化的酒店推荐。为其推荐相似度高的Top10个其他酒店。

一、搭建应该推荐系统的步骤

1、数据集介绍

Seattle_Hotels数据集是西雅图酒店数据,数据集下载地址,数据集包含三个字段:酒店姓名、地址、以及内容描述。其中每一行代表一个酒店,数据集的具体格式将在代码实现部分进行展示。

2、数据预处理

在构建推荐系统之前,我们需要对数据进行预处理。这包括处理缺失值、清洗数据、转换数据类型等。我们将使用Python中的Pandas库来加载和处理Seattle_Hotels数据集,并确保数据的完整性和一致性。

3、特征工程

为了构建推荐系统,我们需要从酒店数据中提取有意义的特征。这可以包括酒店的位置、星级评级、设施等。我们将使用适当的特征工程技术,如独热编码、标准化等,对特征进行处理,以便后续的相似度计算和推荐算法能够准确地工作。

4、相似度计算

基于相似度的推荐系统依赖于计算酒店之间的相似度。我们将介绍几种常用的相似度计算方法,如欧氏距离、余弦相似度等,并解释如何在Python中使用这些方法进行相似度计算。通过计算相似度,我们可以找到与用户喜好相近的酒店。

5、推荐算法

在计算酒店之间的相似度之后,我们可以根据用户的偏好和历史行为,使用推荐算法为用户生成个性化的酒店推荐列表。我们将介绍一些常用的推荐算法,如基于用户的协同过滤、基于物品的协同过滤等,并演示如何在Python中实现这些算法。

6、系统评估和改进

构建推荐系统后,我们需要对其进行评估和改进。我们将介绍一些常用的评估指标,如准确率、召回率等,来评估推荐系统的性能。如果系统表现不佳,我们还将讨论一些改进方法,如引入隐语义模型、使用深度学习等。

二、基于酒店文本描述来推荐相似酒店的python实现

导入相关的数据包

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
# # import cufflinks
# # from plotly.offline import iplot
# cufflinks.go_offline()

导入数据并查看数据

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()

name	address	desc
0	Hilton Garden Seattle Downtown	1821 Boren Avenue, Seattle Washington 98101 USA	Located on the southern tip of Lake Union, the...
1	Sheraton Grand Seattle	1400 6th Avenue, Seattle, Washington 98101 USA	Located in the city's vibrant core, the Sherat...
2	Crowne Plaza Seattle Downtown	1113 6th Ave, Seattle, WA 98101	Located in the heart of downtown Seattle, the ...
3	Kimpton Hotel Monaco Seattle	1101 4th Ave, Seattle, WA98101	What?s near our hotel downtown Seattle locatio...
4	The Westin Seattle	1900 5th Avenue, Seattle, Washington 98101 USA	Situated amid incredible shopping and iconic a...

查看数据的维度和具体的描述信息

df.shape
# (152, 3)
df['desc'][100]
'On a budget in Seattle or looking for something different? The historic charm and "home away from home" atmosphere of The Baroness will be sure to make you feel like one of the family. Conveniently located on First Hill, we are proud to be part of the Virginia Mason Hospital campus and only minutes from Harborview Medical Center and Swedish Hospital. The Baroness Hotel is a great option for short or long term medical, patient or family stays. Whether you are visiting the area\'s world-class medical facilities or on a budget vacation, our goal is to ensure a wonderful stay. Guest Amenities: Complimentary Internet access, Two twin, one or two queen studios with mini fridge and microwave, Two twin or one queen suites with full kitchens, Laundry facilities available, Flat screen cable television with HBO, Complimentary local calls, Ice and vending machines located in the lobby, Coffee maker and hairdryers in all guestrooms, Room service available seven days a week from the Rhododendron Cafe, Limited wheelchair accessibility, Guest library and business center, Printing & fax services available, 100% non-smoking and pet free, Rooms are not air conditioned - fans are available, Self-parking available at Virginia Mason hospital for a fee.'

看一下酒店介绍中主要描述信息
将所有的描述进行CountVectorizer()特征数值计算,得到描述的描述的文本特征矩阵bag_of_words。

vec = CountVectorizer().fit(df['desc'])
bag_of_words = vec.transform(df['desc'])

bag_of_words.toarray()
array([[0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]], dtype=int64)

bag_of_words.shape
(152, 3200)

统计某一个词出现的次数

sum_words = bag_of_words.sum(axis=0)
sum_words
matrix([[ 1, 11, 11, ...,  2,  6,  2]], dtype=int64)
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq
[('located', 108),
 ('on', 129),
 ('the', 1258),
 ('southern', 1),
 ('tip', 1),
 ('of', 536),
 ('lake', 41),
 ('union', 33),
 ('hilton', 12),
 ('garden', 11),
 ('inn', 89),
 ('seattle', 533),
 ('downtown', 133),
 ('hotel', 295),
 ('is', 271),
 ('perfectly', 6),
 ('for', 216),
 ('business', 87),
 ('and', 1062),
 ('leisure', 18),
 ('neighborhood', 35),
 ('home', 57),
 ('to', 471),
 ('numerous', 1),
 ('major', 12),
...
 ('driving', 3),
 ('those', 4),
 ('coming', 2),
 ('tac', 15),
 ...]

对词频统计的结果进行排序

words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
words_freq
[('the', 1258),
 ('and', 1062),
 ('of', 536),
 ('seattle', 533),
 ('to', 471),
 ('in', 449),
 ('our', 359),
 ('you', 304),
 ('hotel', 295),
 ('with', 280),
 ('is', 271),
 ('at', 231),
 ('from', 224),
 ('for', 216),
 ('your', 186),
 ('or', 161),
 ('center', 151),
 ('are', 136),
 ('downtown', 133),
 ('on', 129),
 ('we', 128),
 ('free', 123),
 ('as', 117),
 ('located', 108),
 ('rooms', 106),
...
 ('outdoors', 3),
 ('fans', 3),
 ('athletic', 3),
 ('begin', 3),
 ...]
def get_top_n_words(corpus,n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    return words_freq[:n]

获取词频统计的前20个词

common_words=get_top_n_words(df['desc'],20)
common_words
[('the', 1258),
 ('and', 1062),
 ('of', 536),
 ('seattle', 533),
 ('to', 471),
 ('in', 449),
 ('our', 359),
 ('you', 304),
 ('hotel', 295),
 ('with', 280),
 ('is', 271),
 ('at', 231),
 ('from', 224),
 ('for', 216),
 ('your', 186),
 ('or', 161),
 ('center', 151),
 ('are', 136),
 ('downtown', 133),
 ('on', 129)]

将词频统计的结果转换成DataFrame

df1 = pd.DataFrame(common_words,columns=['desc','count'])
df1.head()

desc	count
0	the	1258
1	and	1062
2	of	536
3	seattle	533
4	to	471
df1.groupby('desc').sum()['count'].sort_values().plot(kind='barh',yTitle='Count',linecolor='black',title='top 20 before remove stopwords')
def get_top_n_words(corpus,n=None):
    vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    return words_freq[:n]

去英文停用词之后的前20个词频

common_words=get_top_n_words(df['desc'],20)
df2 = pd.DataFrame(common_words,columns=['desc','count'])
df2.groupby('desc').sum()['count'].sort_values().iplot(kind='barh',yTitle='Count',linecolor='black',title='top 20 after remove stopwords')
def get_top_n_words(corpus,n=None):
    vec = CountVectorizer(stop_words='english',ngram_range=(1,3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    return words_freq[:n]
common_words=get_top_n_words(df['desc'],20)
df3 = pd.DataFrame(common_words,columns=['desc','count'])
df3.groupby('desc').sum()['count'].sort_values().iplot(kind='barh',yTitle='Count',linecolor='black',title='top 20 before remove stopwords-ngram_range=(2,2)')

描述的一些统计信息

df['word_count']=df['desc'].apply(lambda x:len(str(x).split()))

df.head()

name	address	desc	word_count
0	Hilton Garden Seattle Downtown	1821 Boren Avenue, Seattle Washington 98101 USA	Located on the southern tip of Lake Union, the...	184
1	Sheraton Grand Seattle	1400 6th Avenue, Seattle, Washington 98101 USA	Located in the city's vibrant core, the Sherat...	152
2	Crowne Plaza Seattle Downtown	1113 6th Ave, Seattle, WA 98101	Located in the heart of downtown Seattle, the ...	147
3	Kimpton Hotel Monaco Seattle	1101 4th Ave, Seattle, WA98101	What?s near our hotel downtown Seattle locatio...	150
4	The Westin Seattle	1900 5th Avenue, Seattle, Washington 98101 USA	Situated amid incredible shopping and iconic a...	151

词频可视化展示

df['word_count'].plot(kind='hist',bins=50)

文本处理

sub_replace = re.compile('[^0-9a-z #+_]')
#stopwords = set(stopwords.words('english'))
stopwods = ['the','a','an','in']
def clean_txt(text):
    text.lower()
    text = sub_replace.sub('',text)
    ' '.join(word for word in text.split() if word not in stopwords)
    return text
df['desc_clean'] = df['desc'].apply(clean_txt)
df.head()
name	address	desc	word_count	desc_clean
0	Hilton Garden Seattle Downtown	1821 Boren Avenue, Seattle Washington 98101 USA	Located on the southern tip of Lake Union, the...	184	ocated on the southern tip of ake nion the ilt...
1	Sheraton Grand Seattle	1400 6th Avenue, Seattle, Washington 98101 USA	Located in the city's vibrant core, the Sherat...	152	ocated in the citys vibrant core the heraton r...
2	Crowne Plaza Seattle Downtown	1113 6th Ave, Seattle, WA 98101	Located in the heart of downtown Seattle, the ...	147	ocated in the heart of downtown eattle the awa...
3	Kimpton Hotel Monaco Seattle	1101 4th Ave, Seattle, WA98101	What?s near our hotel downtown Seattle locatio...	150	hats near our hotel downtown eattle location h...
4	The Westin Seattle	1900 5th Avenue, Seattle, Washington 98101 USA	Situated amid incredible shopping and iconic a...	151	ituated amid incredible shopping and iconic at...
df['desc'][0]
df['desc_clean'][0]

相似度计算

df.set_index('name',inplace = True)
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),stop_words='english')
tfidf_matrix=tf.fit_transform(df['desc_clean'])
tfidf_matrix.shape
(152, 27976)
cosine_similarity =linear_kernel(tfidf_matrix,tfidf_matrix)
cosine_similarity.shape
(152, 152)
cosine_similarity[0]
array([1.00000000e+00, 1.07618507e-02, 2.39000494e-02, 5.46873017e-03,
       2.64161143e-02, 1.05158253e-02, 1.70265099e-02, 1.26932177e-02,
       6.55905011e-03, 1.89826340e-02, 1.01682769e-02, 5.81427763e-03,
       8.97164751e-03, 5.11332703e-03, 6.98081551e-03, 1.46651716e-02,
       1.01506328e-02, 3.48428336e-02, 1.05628890e-02, 2.03920044e-02,
       2.31715424e-02, 8.66803402e-03, 4.19927749e-03, 1.25464260e-02,
       1.35516385e-02, 1.90864472e-02, 2.92211862e-02, 5.29767659e-03,
       2.34027898e-02, 1.84009370e-02, 1.11063777e-02, 3.24877554e-02,
       1.59088468e-02, 2.03903610e-02, 3.34542421e-02, 2.08424726e-02,
       6.37061770e-03, 7.22769959e-03, 1.76879937e-02, 3.40610778e-02,
       1.39733856e-02, 7.16109150e-03, 1.40189178e-02, 3.08597799e-02,
       3.31898710e-02, 1.32485388e-02, 3.49498978e-02, 1.03401842e-02,
       2.91144195e-02, 1.41758154e-02, 2.22237640e-02, 1.64940308e-02,
       3.11683463e-02, 1.59544326e-02, 2.61636177e-02, 1.26140542e-02,
       2.14668363e-02, 2.62642643e-02, 4.91030598e-03, 2.78596805e-02,
       1.96779398e-02, 9.81505558e-03, 3.88536015e-02, 2.78932747e-02,
       1.53453198e-02, 9.00494748e-03, 2.90988366e-02, 7.52572710e-03,
       1.50339228e-02, 7.23229675e-03, 2.08907559e-02, 1.46102170e-02,
       2.38744140e-02, 2.08593020e-02, 2.05556244e-02, 5.08364922e-02,
       2.49582978e-03, 1.22351607e-02, 9.69353352e-03, 2.47634675e-02,
       6.16721807e-03, 1.28568641e-02, 8.52080157e-04, 4.25496742e-03,
       1.19408976e-02, 3.78787891e-02, 8.76879249e-03, 2.78619543e-03,
       6.72632425e-03, 1.21664341e-02, 7.22174485e-03, 6.21120314e-03,
       9.28807898e-03, 5.01326402e-03, 1.47909582e-02, 1.18810730e-02,
       5.55255877e-03, 1.46679942e-02, 1.23004765e-02, 2.59809457e-02,
...
       1.49672160e-02, 1.59649598e-02, 2.58764614e-02, 5.00635020e-03,
       2.27410363e-02, 9.26581208e-03, 1.35304359e-02, 1.40490270e-02,
       1.66688259e-02, 2.27161327e-02, 2.78165984e-02, 3.70680069e-03,
       3.48439660e-03, 2.76986975e-03, 1.85339056e-02, 7.80938853e-03,
       3.97319010e-03, 8.70843653e-03, 2.53198268e-03, 7.08322188e-03])
indices = pd.Series(df.index)
indices[:5]
0    Hilton Garden Seattle Downtown
1            Sheraton Grand Seattle
2     Crowne Plaza Seattle Downtown
3     Kimpton Hotel Monaco Seattle 
4                The Westin Seattle
Name: name, dtype: object
def recommendations(name,cosine_similarity):
    recommended_hotels = []
    idx = indices[indices == name].index[0]
    score_series = pd.Series(cosine_similarity[idx]).sort_values(ascending=False)
    top_10_indexes = list(score_series[1:11].index)
    for i in top_10_indexes:
        recommended_hotels.append(list(df.index)[i])
    return recommended_hotels
recommendations('Hilton Garden Seattle Downtown',cosine_similarity)
['Staybridge Suites Seattle Downtown - Lake Union',
 'Silver Cloud Inn - Seattle Lake Union',
 'Residence Inn by Marriott Seattle Downtown/Lake Union',
 'MarQueen Hotel',
 'Embassy Suites by Hilton Seattle Tacoma International Airport',
 'Silver Cloud Hotel - Seattle Broadway',
 'The Loyal Inn',
 'Homewood Suites by Hilton Seattle Downtown',
 'Inn at Queen Anne',
 'SpringHill Suites Seattle\xa0Downtown']

至此,整个推荐的流程结束,具体的代码文件后续将打包上传。

标签:02,03,酒店,...,文本,words,desc,Seattle,描述
From: https://blog.csdn.net/qq_38614074/article/details/136669684

相关文章

  • 酒店需要用堡垒机的几个理由以及堡垒机品牌推荐
    酒店,一个大家都熟悉的地方,工作旅游吃喝玩乐都可以实现的地方。对于客人而言,酒店安全至关重要,不仅需要酒店保障人身安全,也需要酒店保障客户信息安全。因此酒店行业也是需要做好数据安全运维的。这里给酒店行业推荐使用堡垒机。酒店需要用堡垒机的几个理由1、统一管理;2、提......
  • 基于英特尔® Gaudi® 2 AI 加速器的文本生成流水线
    随着生成式人工智能(GenerativeAI,GenAI)革命的全面推进,使用Llama2等开源transformer模型生成文本已成为新风尚。人工智能爱好者及开发人员正在寻求利用此类模型的生成能力来赋能不同的场景及应用。本文展示了如何基于OptimumHabana以及我们实现的流水线类轻松使用Llam......
  • Qt QTextStream 类(文本流)和 QDataStream 类(数据流)
    一、二者区别(1)QTextStream类:用于对数据进行文本格式的读/写操作,可在QString、QIODevice或QByteArray上运行,比如把数据输出到QString、QIODevice或QByteArray对象上,或进行相反的操作。(2)QDataStream类:用于对数据进行二进制格式的读/写操作,QDataStream只可在QIOD......
  • C语言数据结构实现酒店管理
    #include<stdio.h>#include<windows.h>#include<stdlib.h> #include<string.h>//用于用户验证 #defineMAX100//最大房间容量 #defineStytm20#definemAX1024//文件读取字符长 intfileHang(FILE*fp);intlength=0;//房间顺序 typedefintDataType;typ......
  • 文本相似度检测
    这个作业属于哪个课程软件工程这个作业要求在哪里个人项目这个作业的目标写一个程序实现文本相似度检测功能,学习用github等工具管理代码,学习使用工具分析代码,测试程序GitHub地址https://github.com/Tamakocode/3122004794一.需求题目:论文查重描述如下:设......
  • 想做漫画的ai短视频伙伴有福了,这个ai免费网站只需要提示词,自动生成故事文本和漫画图,堪
    现在很多人都在做漫画类图文或者短视频,这点高粱seo之前也是提到的,同时也分享过一些免费生成ai漫画的网站,那么今天高粱seo再给大家分享一个非常不错的免费ai网站,只需要输入提示词,就可以一键生成漫画图。那么下面高粱seo就以实操案例给大家分享下吧。这次高粱seo以孙悟空大闹天......
  • Linux系统——AWK文本处理拓展
    目录一、分析Nginx访问日志二、文件差异对比1.实验环境2.找出b文件在a文件相同记录3.找出b文件在a文件不同记录三、合并两个文件1.生成实验环境2.将a文件合并到b文件3.将a文件相同IP的服务名合并四、将第一列合并到一行五、字符串拆分六、统计出现的次数七、获取......
  • 生成二维码及二维码添加文本及图片
      生成二维码及二维码添加文本及图片如果要输出流,也可以参考此处packagecom.myFirstSpring.test;importjava.awt.BasicStroke;importjava.awt.Color;importjava.awt.Font;importjava.awt.FontMetrics;importjava.awt.Graphics;importjava.awt.Graphics2D;impo......
  • WPF RichTextBox 文本超过限定行数移除旧数据
    在使用serilog.sinks.richtextbox显示日志时,会需要移除旧的日志信息的需求,实现打码如下;根据换行符“\n”进行判断; privatevoidCheckAndRemoveText(){intnewLineCount=0;boolremoveText=false;foreach(Paragraphparagraphin_richTex......
  • python酒店相似度推荐系统
    importnumpyasnpimportpandasaspdfromnltk.corpusimportstopwordsfromsklearn.metrics.pairwiseimportlinear_kernelfromsklearn.feature_extraction.textimportCountVectorizerfromsklearn.feature_extraction.textimportTfidfVectorizerfromsklear......