前言
今天读的一篇论文题目为《协同过滤技术综述》(A Survey of Collaborative Filtering Techniques),文章发表于《人工智能研究进展》(Advances in Artificial Intelligence)。
要引用这篇论文,请使用下述格式:
Xiaoyuan Su, Taghi M. Khoshgoftaar, "A Survey of Collaborative Filtering Techniques", Advances in Artificial Intelligence, vol. 2009, Article ID 421425, 19 pages, 2009. https://doi.org/10.1155/2009/421425
摘要
As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, model-based, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.
协同过滤(CF)是构建推荐系统最成功的方法之一,它利用一组用户的已知偏好为其他用户提供推荐或预测未知偏好。本文首先介绍了CF任务及其面临的主要挑战,如数据稀疏性、可扩展性、同义词、灰羊、先令攻击、隐私保护等,以及可能的解决方案。然后,我们介绍了CF技术的三个主要类别:基于内存的、基于模型的和混合CF算法(将CF与其他推荐技术结合起来),并提供了每个类别的代表性算法的示例,并分析了它们的预测性能和解决挑战的能力。从基本技术到最先进的技术,我们试图对CF技术进行全面的调查,这可以作为该领域研究和实践的路线图。
引言
In everyday life, people rely on recommendations from other people by spoken words, reference letters, news reports from news media, general surveys, travel guides, and so forth. Recommender systems assist and augment this natural social process to help people sift through available books, articles, webpages, movies, music, restaurants, jokes, grocery products, and so forth to find the most interesting and valuable information for them. The developers of one of the first recommender systems, Tapestry[1] (other earlier recommendation systems include rule-based recommenders and user-customization), coined the phrase “collaborative filtering (CF),” which has been widely adopted regardless of the facts that recommenders may not explicitly collaborate with recipients and recommendations may suggest particularly interesting items, in addition to indicating those that should be filtered out[2]. The fundamental assumption of CF is that if users and rate items similarly, or have similar behaviors (e.g., buying, watching, listening), and hence will rate or act on other items similarly[3].
在日常生活中,人们依靠别人的口头推荐、推荐信、新闻媒体的新闻报道、普查、旅游指南等。推荐系统帮助和增强这种自然的社会过程,帮助人们从可用的书籍、文章、网页、电影、音乐、餐馆、笑话、杂货店产品等中筛选,以找到对他们来说最有趣和最有价值的信息。最早的推荐系统之一:Tapestry(其他早期的推荐系统包括基于规则的推荐系统和用户定制系统)的开发人员创造了“协同过滤(CF)”这个短语,这个短语已经被广泛采用,不管推荐者可能不明确地与收件人协作,推荐可能建议特别有趣的项目,除了指出应该过滤掉的项目之外。CF的基本假设是,如果用户对物品的评分相似,或具有相似的行为(例如,购买、观看、收听),因此会对其他物品进行相似的评分或行为。
CF techniques use a database of preferences for items by users to predict additional topics or products a new user might like. In a typical CF scenario, there is a list of users and a list of n items , and each user, , has a list of items, , which the user has rated, or about which their preferences have been inferred through their behaviors. The ratings can either be explicit indications, and so forth, on a 1–5 scale, or implicit indications, such as purchases or click-throughs[4]. For example, we can convert the list of people and the movies they like or dislike (Table 1(a)) to a user-item ratings matrix (Table 1(b)), in which Tony is the active user that we want to make recommendations for. There are missing values in the matrix where users did not give their preferences for certain items.
CF技术使用用户对物品的偏好数据库来预测新用户可能喜欢的其他主题或产品。在典型的CF场景中,有一个用户列表和一个包含n个物品的列表,每个用户都有一个物品列表,该用户对这些物品进行了评分,或者通过他们的行为推断出他们的偏好。评级可以是明确的指示,等等,在1-5的范围内,也可以是隐性的指示,例如购买或点击。例如,我们可以将人以及他们喜欢或不喜欢的电影列表(表1(a))转换为一个用户-物品评分矩阵(表1(b)),其中Tony是我们想要推荐的活跃用户。矩阵中存在缺失值,即用户没有给出他们对某些项目的偏好。
Table 1(a):
Table 1(b):
There are many challenges for collaborative filtering tasks (Section 2). CF algorithms are required to have the ability to deal with highly sparse data, to scale with the increasing numbers of users and items, to make satisfactory recommendations in a short time period, and to deal with other problems like synonymy (the tendency of the same or similar items to have different names), shilling attacks, data noise, and privacy protection problems.
协同过滤任务有许多挑战(第2节)。CF算法需要能够处理高度稀疏的数据,随着用户和项目数量的增加而扩展,在短时间内做出令人满意的推荐,并处理其他问题,如同义词(相同或类似的项目有不同名称的趋势),先令攻击,数据噪声和隐私保护问题。
Early generation collaborative filtering systems, such as GroupLens [5], use the user rating data to calculate the similarity or weight between users or items and make predictions or recommendations according to those calculated similarity values. The so-called memory-based CF methods (Section 3) are notably deployed into commercial systems such as http://www.amazon.com/ (see an example in Figure 1) and Barnes and Noble, because they are easy-to-implement and highly effective [6, 7]. Customization of CF systems for each user decreases the search effort for users. It also promises a greater customer loyalty, higher sales, more advertising revenues, and the benefit of targeted promotions [8].
早期的协同过滤系统,如GroupLens[5],使用用户评分数据来计算用户或项目之间的相似度或权重,并根据计算出的相似度值进行预测或推荐。所谓的基于内存的CF方法(第3节)被显著地部署到商业系统中,例如http://www.amazon.com/ (参见图1中的示例)和Barnes and Noble,因为它们易于实现且高效[6,7]。为每个用户定制CF系统可以减少用户的搜索工作量。它还承诺更高的客户忠诚度,更高的销售额,更多的广告收入,以及有针对性的促销[8]。
However, there are several limitations for the memory-based CF techniques, such as the fact that the similarity values are based on common items and therefore are unreliable when data are sparse and the common items are therefore few. To achieve better prediction performance and overcome shortcomings of memory-based CF algorithms, model-based CF approaches have been investigated. Model-based CF techniques (Section 4) use the pure rating data to estimate or learn a model to make predictions [9]. The model can be a data mining or machine learning algorithm. Well-known model-based CF techniques include Bayesian belief nets (BNs) CF models [9–11], clustering CF models [12, 13], and latent semantic CF models [7]. An MDP (Markov decision process)-based CF system [14] produces a much higher profit than a system that has not deployed the recommender.
然而,基于内存的CF技术存在一些限制,例如,相似度值是基于公共项的,因此在数据稀疏且公共项较少的情况下是不可靠的。为了获得更好的预测性能并克服基于内存的CF算法的不足,研究了基于模型的CF方法。基于模型的CF技术(第4节)使用纯评级数据来估计或学习模型以进行预测[9]。该模型可以是数据挖掘或机器学习算法。众所周知的基于模型的CF技术包括贝叶斯信念网(BNs) CF模型[9-11]、聚类CF模型[12,13]和潜在语义CF模型[7]。基于MDP(马尔可夫决策过程)的CF系统[14]比没有部署推荐器的系统产生更高的利润。
Besides collaborative filtering, content-based filtering is another important class of recommender systems. Content-based recommender systems make recommendations by analyzing the content of textual information and finding regularities in the content. The major difference between CF and content-based recommender systems is that CF only uses the user-item ratings data to make predictions and recommendations, while content-based recommender systems rely on the features of users and items for predictions [15]. Both content-based recommender systems and CF systems have limitations. While CF systems do not explicitly incorporate feature information, content-based systems do not necessarily incorporate the information in preference similarity across individuals [8].
除了协同过滤,基于内容的过滤是另一类重要的推荐系统。基于内容的推荐系统通过分析文本信息的内容,发现内容中的规律来进行推荐。CF与基于内容的推荐系统的主要区别在于CF只使用用户-项目的评分数据进行预测和推荐,而基于内容的推荐系统则依靠用户和项目的特征进行预测[15]。基于内容的推荐系统和协同过滤系统都有其局限性。虽然CF系统没有显式地纳入特征信息,但基于内容的系统并不一定纳入个体间偏好相似性信息[8]。
Hybrid CF techniques, such as the content-boosted CF algorithm [16] and Personality Diagnosis (PD) [17], combine CF and content-based techniques, hoping to avoid the limitations of either approach and thereby improve recommendation performance (Section 5).
混合协同过滤技术,如基于内容的协同过滤算法[16]和基于内容的协同过滤算法[17],希望能够避免两者的局限性,从而提高推荐性能(第5节)。
表2描述了CF技术的简要概述。
Table2:
CF categories | Representative techniques | Main advantages | Main shortcomings |
---|---|---|---|
Memory-based CF | eighbor-based CF (item-based/user-based CF algorithms with Pearson/vector cosine correlation) | asy implementation | *are dependent on human ratings |
ew data can be added easily and incrementally | *performance decrease when data are sparse | ||
tem-based/user-based top- recommendations | eed not consider the content of the items being recommended | *cannot recommend for new users and items | |
*scale well with co-rated items | *have limited scalability for large datasets | ||
Model-based CF | ayesian belief nets CF | *better address the sparsity, scalability and other problems | *expensive model-building |
lustering CF | |||
DP-based CF | *improve prediction performance | *have trade-off between prediction performance and scalability | |
atent semantic CF | |||
parse factor analysis | *give an intuitive rationale for recommendations | *lose useful information for dimensionality reduction techniques | |
F using dimensionality reduction techniques, for example, SVD, PCA | |||
Hybrid recommenders | ontent-based CF recommender, for example, Fab | *overcome limitations of CF and content-based or other recommenders | *have increased complexity and expense for implementation |
ontent-boosted CF | *improve prediction performance | *need external information that usually not available | |
ybrid CF combining memory-based and model-based CF algorithms, for example, Personality Diagnosis | *overcome CF problems such as sparsity and gray sheep |
协同过滤的特点与挑战(Characteristics and Challenges of Collaborative Filtering)
E-commerce recommendation algorithms often operate in a challenging environment, especially for large online shopping companies like eBay and Amazon. Usually, a recommender system providing fast and accurate recommendations will attract the interest of customers and bring benefits to companies. For CF systems, producing high-quality predictions or recommendations depends on how well they address the challenges, which are characteristics of CF tasks as well.
电子商务推荐算法通常在具有挑战性的环境中运行,特别是对于像eBay和Amazon这样的大型在线购物公司。通常,推荐系统提供快速准确的推荐会吸引客户的兴趣,为企业带来效益。对于CF系统,产生高质量的预测或推荐取决于它们如何解决挑战,这也是CF任务的特点。
2.1. 数据稀疏 Data Sparsity
In practice, many commercial recommender systems are used to evaluate very large product sets. The user-item matrix used for collaborative filtering will thus be extremely sparse and the performances of the predictions or recommendations of the CF systems are challenged.
在实践中,许多商业推荐系统被用于评估非常大的产品集。因此,用于协同过滤的用户-项目矩阵将非常稀疏,从而影响协同过滤系统的预测或推荐性能。
The data sparsity challenge appears in several situations, specifically, the cold start problem occurs when a new user or item has just entered the system, it is difficult to find similar ones because there is not enough information (in some literature, the cold start problem is also called the new user problem or new item problem [21, 22]). New items cannot be recommended until some users rate it, and new users are unlikely given good recommendations because of the lack of their rating or purchase history. Coverage can be defined as the percentage of items that the algorithm could provide recommendations for. The reduced coverage problem occurs when the number of users’ ratings may be very small compared with the large number of items in the system, and the recommender system may be unable to generate recommendations for them. Neighbor transitivity refers to a problem with sparse databases, in which users with similar tastes may not be identified as such if they have not both rated any of the same items. This could reduce the effectiveness of a recommendation system which relies on comparing users in pairs and therefore generating predictions.
数据稀疏性挑战出现在多种情况下,具体来说,冷启动问题发生在新用户或物品刚刚进入系统时,由于信息不足,很难找到相似的用户(在一些文献中,冷启动问题也被称为新用户问题或新物品问题[22,22])。新商品只有在有用户评分时才能被推荐,而新用户不太可能得到好的推荐,因为他们没有评分或购买历史记录。覆盖率可以定义为算法可以提供推荐的项目的百分比。当用户的评分数量相对于系统中的项目数量可能非常少,推荐系统可能无法为其产生推荐时,就会出现覆盖率降低的问题。邻居传递性指的是稀疏数据库中的一个问题,在这种情况下,如果品味相似的用户没有对相同的物品进行评分,那么他们可能无法被识别出来。这可能会降低推荐系统的有效性,因为推荐系统依赖于成对地比较用户,从而产生预测。
To alleviate the data sparsity problem, many approaches have been proposed. Dimensionality reduction techniques, such as Singular Value Decomposition (SVD) [23], remove unrepresentative or insignificant users or items to reduce the dimensionalities of the user-item matrix directly. The patented Latent Semantic Indexing (LSI) used in information retrieval is based on SVD [24, 25], in which similarity between users is determined by the representation of the users in the reduced space. Goldberg et al. [3] developed eigentaste, which applies Principle Component Analysis (PCA), a closely-related factor analysis technique first described by Pearson in 1901 [26], to reduce dimensionality. However, when certain users or items are discarded, useful information for recommendations related to them may get lost and recommendation quality may be degraded [6, 27].
为了缓解数据稀疏性问题,人们提出了许多方法。降维技术,如奇异值分解(SVD)[23],通过去除不具有代表性或不重要的用户或物品,直接降低用户-物品矩阵的维数。信息检索中使用的专利潜在语义索引(LSI)是基于SVD的[24,25],其中用户之间的相似性由用户在缩减空间中的表示来决定。Goldberg等人[3]开发了eigentaste,它应用主成分分析(PCA)来降低维度,PCA是Pearson在1901年首次描述的一种密切相关的因子分析技术。然而,当某些用户或物品被丢弃时,与他们相关的有用推荐信息可能会丢失,从而导致推荐质量下降[6,27]。
Hybrid CF algorithms, such as the content-boosted CF algorithm [16], are found helpful to address the sparsity problem, in which external content information can be used to produce predictions for new users or new items. In Ziegler et al. [28], a hybrid collaborative filtering approach was proposed to exploit bulk taxonomic information designed for exact product classification to address the data sparsity problem of CF recommendations, based on the generation of profiles via inference of super-topic score and topic diversification [28]. Schein et al. proposed the aspect model latent variable method for cold start recommendation, which combines both collaborative and content information in model fitting [29]. Kim and Li proposed a probabilistic model to address the cold start problem, in which items are classified into groups and predictions are made for users considering the Gaussian distribution of user ratings [30].
混合协同过滤算法,如[16]算法,可以利用外部内容信息对新用户或新物品进行预测,有助于解决稀疏性问题。在Ziegler等人的论文[28]中,针对协同过滤推荐的数据稀疏性问题,提出了一种基于超主题分数推理和主题多样化生成profile的混合协同过滤方法[28]。Schein等人提出了面向冷启动推荐的方面模型隐变量方法,该方法在模型拟合[29]时结合了协同信息和内容信息。Kim和Li提出了一种概率模型来解决冷启动问题,利用用户评分[30]的高斯分布对项目进行分组并对用户进行预测。
Model-based CF algorithms, such as TAN-ELR (tree augmented naïve Bayes optimized by extended logistic regression) [11, 31], address the sparsity problem by providing more accurate predictions for sparse data. Some new model-based CF techniques that tackle the sparsity problem include the association retrieval technique, which applies an associative retrieval framework and related spreading activation algorithms to explore transitive associations among users through their rating and purchase history [32]; Maximum margin matrix factorizations (MMMF), a convex, infinite dimensional alternative to low-rank approximations and standard factor models [33, 34]; ensembles of MMMF [35]; multiple imputation-based CF approaches [36]; and imputation-boosted CF algorithms [37].
基于模型的CF算法,如TAN-ELR (tree augmented naïve Bayes optimized by extended logistic regression)[11,31],通过为稀疏数据提供更准确的预测来解决稀疏问题。一些新的基于模型的协同过滤技术解决了稀疏性问题,包括关联检索技术,它应用关联检索框架和相关的激活传播算法,通过用户的评分和购买历史[32]探索用户之间的传递性关联;最大间隔矩阵分解(MMMF),一种低秩近似和标准因子模型的凸无限维替代方案[33,34];MMMF[35]集成;基于多插补的协同过滤算法[36];以及插补增强的CF算法[37]。
2.2. 可伸缩性 Scalability
When numbers of existing users and items grow tremendously, traditional CF algorithms will suffer serious scalability problems, with computational resources going beyond practical or acceptable levels. For example, with tens of millions of customers and millions of distinct catalog items , a CF algorithm with the complexity of is already too large. As well, many systems need to react immediately to online requirements and make recommendations for all users regardless of their purchases and ratings history, which demands a high scalability of a CF system [6].
当现有用户和物品数量急剧增长时,传统的协同过滤算法会出现严重的可扩展性问题,导致计算资源超出实际或可接受的水平。例如,面对数以千万计的客户和数以百万计的不同商品,复杂度为的协同过滤算法已经非常庞大。此外,许多系统需要对在线需求立即做出反应,并为所有用户提供推荐,而不管他们的购买记录和评分记录如何,这对[6]系统的可扩展性提出了更高的要求。
Dimensionality reduction techniques such as SVD can deal with the scalability problem and quickly produce good quality recommendations, but they have to undergo expensive matrix factorization steps. An incremental SVD CF algorithm [38] precomputes the SVD decomposition using existing users. When a new set of ratings are added to the database, the algorithm uses the folding-in projection technique [25, 39] to build an incremental system without re-computing the low-dimensional model from scratch. Thus it makes the recommender system highly scalable.
SVD等降维技术可以解决可扩展性问题,并快速产生高质量的推荐,但它们必须经历昂贵的矩阵分解步骤。增量式SVD CF算法[38]利用已有用户预先计算SVD分解。当一组新的评分添加到数据库时,该算法使用折叠投影技术[25,39]来构建一个增量系统,而无需从头重新计算低维模型。从而使推荐系统具有高度的可扩展性。
Memory-based CF algorithms, such as the item-based Pearson correlation CF algorithm can achieve satisfactory scalability. Instead of calculating similarities between all pairs of items, item-based Pearson CF calculates the similarity only between the pair of co-rated items by a user [6, 40]. A simple Bayesian CF algorithm tackles the scalability problem by making predictions based on observed ratings [41]. Model-based CF algorithms, such as clustering CF algorithms, address the scalability problem by seeking users for recommendation within smaller and highly similar clusters instead of the entire database [13, 42–44], but there are tradeoffs between scalability and prediction performance.
基于内存的协同过滤算法,如基于物品的皮尔逊相关协同过滤算法,可以达到令人满意的可扩展性。基于项目的Pearson CF只计算用户共同评分的项目对之间的相似度[6,40],而不是计算所有项目对之间的相似度。一个简单的贝叶斯协同过滤算法通过根据观察到的评分[41]进行预测来解决可扩展性问题。基于模型的协同过滤算法,如聚类协同过滤算法,通过在较小且高度相似的集群中寻找用户进行推荐,而不是在整个数据库中解决可扩展性问题[13,42 - 44],但在可扩展性和预测性能之间存在权衡。
2.3. 同义词 Synonymy
Synonymy refers to the tendency of a number of the same or very similar items to have different names or entries. Most recommender systems are unable to discover this latent association and thus treat these products differently. For example, the seemingly different items “children movie” and “children film” are actual the same item, but memory-based CF systems would find no match between them to compute similarity. Indeed, the degree of variability in descriptive term usage is greater than commonly suspected. The prevalence of synonyms decreases the recommendation performance of CF systems.
同义词是指许多相同或非常相似的项目有不同名称或条目的趋势。大多数推荐系统无法发现这种潜在的关联,因此对这些产品的处理方式不同。例如,看似不同的项目“儿童电影”和“儿童电影”实际上是相同的项目,但基于内存的CF系统不会在它们之间找到匹配来计算相似度。事实上,描述性术语使用的可变性程度比通常认为的要大。同义词的流行降低了CF系统的推荐性能。
Previous attempts to solve the synonymy problem depended on intellectual or automatic term expansion, or the construction of a thesaurus. The drawback for fully automatic methods is that some added terms may have different meanings from intended, thus leading to rapid degradation of recommendation performance [45].
以前解决同义词问题的尝试依赖于智能或自动术语扩展,或构建同义词典。全自动方法的缺点是,一些添加的术语可能与预期的含义不同,从而导致推荐性能的快速下降[45]。
The SVD techniques, particularly the Latent Semantic Indexing (LSI) method, are capable of dealing with the synonymy problems. SVD takes a large matrix of term-document association data and construct a semantic space where terms and documents that are closely associated are placed closely to each other. SVD allows the arrangement of the space to reflect the major associative patterns in the data, and ignore the smaller, less important ones. The performance of LSI in addressing the synonymy problem is impressive at higher recall levels where precision is ordinarily quite low, thus representing large proportional improvements. However, the performance of the LSI method at the lowest levels of recall is poor [25].
SVD技术,特别是潜在语义索引(LSI)方法,能够处理同义词问题。SVD采用一个大的术语-文档关联数据矩阵,并构建一个语义空间,其中紧密关联的术语和文档彼此紧密放置。SVD允许空间的排列反映数据中的主要关联模式,而忽略较小的、不太重要的关联模式。LSI在解决同义词问题方面的表现在更高的召回水平上令人印象深刻,而精确度通常很低,因此代表了很大的比例改进。然而,大规模集成电路方法在最低召回水平下的性能很差[25]。
The LSI method gives only a partial solution to the polysemy problem, which refers to the fact that most words have more than one distinct meaning [25].
LSI方法只能部分解决多义问题,多义问题指的是大多数单词都有不止一个不同的意思[25]。
文章还列出了几点,在这儿就不一一赘述了。
结尾
今天的论文就先读到这了,精力有限,时间有限,以后有时间再读!明天见喽。
标签:Filtering,based,users,Collaborative,items,CF,用户,Survey,systems From: https://www.cnblogs.com/wephilos/p/18119927