基于标签值分布的强化学习推荐算法(Reinforcement Learning Recommendation Algorithm Based on Label Value Distribution)

标签：Based Scholar Algorithm 推荐用户 Reinforcement 学习算法强化

前言

看论文的第三天，坚持下去。
慢慢来，比较快。 —— 唐迟

本文基于2023年6月28日发表在MATHEMATICS上的一篇名为“基于标签值分布的强化学习推荐算法”(Reinforcement Learning Recommendation Algorithm Based on Label Value Distribution)的文章。文章提出了一种基于标签分布学习的特征工程方法和一种基于值分布强化学习的推荐算法，设计了推荐过程中的随机过程，描述了用户在交互过程中的状态，通过混合推荐策略，能够充分利用用户信息，实现高质量的推荐算法。最后实验证明，文章所提出的算法在准确性、数据利用率、鲁棒性、模型收敛速度和稳定性方面有优势。

需要查看论文或者下载原文的小伙伴点击这里

摘要

Reinforcement learning is an important machine learning method and has become a hot popular research direction topic at present in recent years. The combination of reinforcement learning and a recommendation system, is a very important application scenario and application, and has always received close attention from researchers in all sectors of society. In this paper, we first propose a feature engineering method based on label distribution learning, which analyzes historical behavior is analyzed and constructs, whereby feature vectors are constructed for users and products via label distribution learning. Then, a recommendation algorithm based on value distribution reinforcement learning is proposed. We first designed the stochastic process of the recommendation process, described the user’s state in the interaction process (by including the information on their explicit state and implicit state), and dynamically generated product recommendations through user feedback. Next, by studying hybrid recommendation strategies, we combined the user’s dynamic and static information to fully utilize their information and achieve high-quality recommendation algorithms. Finally, the algorithm was designed and validated, and various relevant baseline models were compared to demonstrate the effectiveness of the algorithm in this study. With this study, we actually tested the remarkable advantages of relevant design models based on nonlinear expectations compared to other homogeneous individual models. The use of recommendation systems with nonlinear expectations has considerably increased the accuracy, data utilization, robustness, model convergence speed, and stability of the systems. In this study, we incorporated the idea of nonlinear expectations into the design and implementation process of recommendation systems. The main practical value of the improved recommendation model is that its performance is more accurate than that of other recommendation models at the same level of computing power level. Moreover, due to the higher amount of information that the enhanced model contains, it provides theoretical support and the basis for an algorithm that can be used to achieve high-quality recommendation services, and it has many application prospects.

翻译：强化学习是一种重要的机器学习方法，近年来已成为当前热门的热门研究方向课题。强化学习与推荐系统的结合，是一种非常重要的应用场景和应用，一直受到社会各界研究人员的密切关注。文中首先提出了一种基于标签分布学习的特征工程方法，通过对历史行为的分析和构造，通过标签分布学习为用户和产品构建特征向量;然后，提出一种基于值分布强化学习的推荐算法。首先设计了推荐过程的随机过程，描述了用户在交互过程中的状态(通过包含其显式状态和隐式状态的信息)，并通过用户反馈动态生成产品推荐;其次，通过研究混合推荐策略，将用户的动静态信息相结合，充分利用用户的信息，实现高质量的推荐算法;最后，对算法进行设计和验证，并与各种相关基线模型进行对比，验证了算法的有效性。通过本研究，我们实际检验了基于非线性期望的相关设计模型相对于其他同质个体模型的显著优势。使用具有非线性期望的推荐系统大大提高了准确性、数据利用率、鲁棒性、模型收敛速度和系统的稳定性。将非线性期望的思想引入到推荐系统的设计和实现过程中。改进推荐模型的主要实用价值在于其性能比同等级计算能力下的其他推荐模型更加准确。此外，由于增强后的模型包含的信息量更高，为实现高质量推荐服务的算法提供了理论支持和基础，具有许多应用前景。

强化学习

强化学习是一种机器学习方法，旨在让智能系统通过与环境的交互学习并自主决策。在强化学习中，智能系统被称为代理（Agent），它通过观察环境的状态（State），执行特定的动作（Action），获得一个奖励信号（Reward），并根据这个奖励信号来调整自己的决策策略，以获得最大的长期累积奖励。

强化学习的核心思想是通过试错学习来提高智能系统的性能。代理根据环境的反馈，可以判断自己的行为是好还是坏，并根据奖励信号来调整行为策略，以使得长期累积奖励最大化。强化学习的目标是通过与环境的交互，使代理能够学习到最优的策略，以使得在给定环境下获得最大的奖励。

强化学习和监督学习、无监督学习有所不同。监督学习是通过标记好的训练数据进行学习，而无监督学习是从无标记的数据中学习。相比之下，强化学习更注重于通过与环境的交互来学习，并且在学习过程中没有明确的正误标记，而是通过奖励信号来指导代理的学习方向。

强化学习在许多领域都有应用，例如机器人控制、游戏智能、自动驾驶等。通过强化学习，智能系统可以从与环境的交互中不断提升自己的能力，逐渐实现自主决策和优化性能的目标。

推荐算法

这篇文章引言部分对推荐算法的发展做了详细的描述：

我们正处于互联网信息爆炸的时代，为了提升用户检索信息的体验，减少因信息过载而导致的不同选择的混乱，推荐系统得到了广泛的应用，为人们的生活带来了相当大的便利。

人们使用互联网的速度越来越快，并且已经记录了许多有关其互联网使用的行为日志和其他数据。然而，互联网信息存量巨大，难以有效利用，导致信息过载。

推荐系统的目的是通过分析用户的行为特征，帮助用户筛选出他们可能感兴趣的产品信息，这是通过使用用户和产品的交互行为数据来实现的。

第一个大规模应用的推荐系统是基于用户的协作过滤系统，例如 1990 年代 Tapestry 的电子邮件过滤系统,此处参考：

Goldberg, D.; Nichols, D.; Oki, B.M.; Terry, D. Using collaborative filtering to weave an information tapestry. ACM 1992, 35, 61–70. [Google Scholar] [CrossRef]Resnick, P.; Iacovou, N.; Suchak, M.; Bergstrom, P.; Riedl, J. Gruoplens:

2003年，亚马逊发明了商品群协同过滤方法，解决了用户群时间复杂度高的问题。此处参考：

Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef] [Green Version]

Breese等人（1998）提出了一种基于模型的系统滤波方法，该方法因其与域无关的特性而被广泛使用。此处参考：

Breese, J.S.; Heckerman, D.E.; Kadie, C.M. Empirical analysis of predictive algorithms for collaborative filtering. Uncertain. Artif. Intell. 1998, 98052, 43–52. [Google Scholar]

深度学习还可用于高效、准确地处理推荐任务，已成为推荐系统研究的重要方向.此处参考

Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep learning based recommender system. ACM Comput. Surv. (CSUR) 2019, 52, 1–38. [Google Scholar] [CrossRef] [Green Version]

关于神经网络在推荐系统中的应用，He et al. （2017）提出了一种神经协同过滤算法，可用于提取用户与项目之间的非线性关系.此处参考：

He, X.; Liao, L.; Zhang, H.; Nie, L.; Chua, T.S. Neural collaborative filtering. In Proceedings of the International World Wide Web Conferences Steering Committee, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]

Hidasi et al. （2016）提出了一种基于会话的RNN推荐模型，该模型以项目的编码为输入，并根据用户的浏览历史预测每个项目被用户点击的可能性.此处参考：

Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-based recommendations with recurrent neural networks. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 498–506. [Google Scholar]

深度学习与推荐技术的结合具有广阔的发展前景，各种结合深度学习的推荐系统层出不穷。对于对数据新鲜度敏感的推荐系统，使用静态方法很难获得满意的结果。在这种情况下，使用强化学习创建建议显然比使用纯静态方法更有效。
常用的方法是置信上限算法。缺点是这种方法与上下文无关。此处参考：

Pang, J.; Hegde, V.; Chalupsky, J. Upper Confidence Bound Algorithm for Oilfield Logic. U.S. Patent US20210150440A1, 20 May 2021. [Google Scholar]

雅虎科学家解决了这个问题，并将LinUCB算法应用于雅虎的新闻推荐中，并考虑了用户和项目的特征。由于使用了更多的信息，与UCB相比，其性能大大提高。此处参考：

Prashanth, L.A.; Korda, N.; Munos, R. Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling. Mach. Learn. 2021, 110, 559–618. [Google Scholar] [CrossRef]

近年来，强化学习越来越受到学者的发展，其与推荐系统的结合也受到了广泛的关注;因此，这种组合成为推荐系统的重要研究方向。

推荐系统的优化目标主要分为评分回归问题和排名问题。此处参考：

Kiruthika, N.S.; Thailambal, G. Dynamic light weight recommendation system for social networking analysis using a hybrid lstm-svm classifier algorithm. Opt. Mem. Neural Netw. 2022, 31, 59–75. [Google Scholar] [CrossRef] 和
Candès, E.J.; Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 2010, 56, 2053–2080. [Google Scholar] [CrossRef] [Green Version]

然而，尽管机器学习和大数据方法极大地促进了推荐系统的发展，但模型的可解释性以及冷启动和静态模型问题仍然存在。此处参考：

Mcnee, S.M.; Riedl, J.; Konstan, J.A. Being accurate is not enough: How accuracy metrics have hurt recommender systems. In Proceedings of the Extended Abstracts 2006 Conference on Human Factors in Computing Systems, Montréal, QC, Canada, 22–27 April 2006; pp. 1097–1101. [Google Scholar]
Hz, A.; Cp, B.; Bm, A.; Tl, A.; Hv, A. New technique to alleviate the cold start problem in recommender systems using information from social media and random decision forests—Sciencedirect. Inf. Sci. 2020, 536, 156–170. [Google Scholar]
Liu, K.; Wei, L.; Chen, X. A new preference-based model to solve the cold start problem in a recommender system. In Proceedings of the 2nd International Conference on Electromechanical Control Technology and Transportation, Zhuhai, China, 14–15 January 2017; pp. 121–126. [Google Scholar]
Kvr, A.; Sjm, B.; Tb, C. Model-Driven Approach Running Route two-level SVD with Context Information and Feature Entities in Recommender System. Comput. Stand. Interfaces 2022, 82, 103627. [Google Scholar]
Xu, R.; Li, J.; Li, G.; Pan, P.; Zhou, Q.; Wang, C. Sdnn: Symmetric deep neural networks with lateral connections for recommender systems. Inf. Sci. 2022, 595, 217–230. [Google Scholar] [CrossRef]

传统推荐系统的性能已经逐渐无法满足人们的各种需求。已经出现了更多可以处理非线性的推荐模型。一个重要的研究方向是强化学习在推荐系统中的应用。以往著作的作者都局限于处理非线性信息，例如使用树模型、深度学习、核方法等来处理数据中的非线性。将非线性数据转换为线性空间，然后使用线性模型进行判别或回归。这种方法在使用完整信息时是可用的，但在使用推荐系统的实际场景中，当使用不完全信息建模时，模型需要处理的期望也是非线性的。因此，确定如何将这种非线性期望引入强化学习算法，并利用推荐系统和工程实践的角度进行创新探索是我们的重点。我们利用来自电子商务仿真环境的数据和用户特征，采用价值分布强化学习和标签分布学习等方法，设计了一种能够平衡多因素、最大程度满足用户需求的推荐算法，将为用户提供高质量的推荐服务。我们的主要贡献是将非线性期望整合到推荐系统的设计和实现过程中，这将使推荐系统能够有效地使用用户行为信息。

文章所使用的方法

模糊马氏距离度量聚类增强

模糊马氏度量聚类用于增强FCM聚类。模糊马氏度量聚类算法计算的隶属度矩阵和质心可以作为模型的基础部分，使抽样具有更直观和定量的有效性。与欧几里得距离测量相比，马氏距离测量的优势在于尺寸处理。因此，与实际计算确定的欧几里得距离相比，该算法可以计算出用户特征的分布。因此，使用该算法时，亚线性编码比欧几里得距离更有效，因为它可以直接用于计算亚线性编码作为聚类方法。

亚线性编码增强

SVD可以对每个用户的现有编码进行分解，得到特征根，与得到的用户社会学属性相结合，成为用户的线性编码。然而，采用分布式学习和非线性期望方法计算的用户亚线性编码能够更准确地表示用户在推荐系统中的兴趣和爱好，因此设计了一种增强的亚线性编码方法来增强对用户静态信息的描述。

上下文-分位数回归强化学习模型的增强

分位数回归强化学习模型与上下文无关。如果该算法仅应用于推荐系统，则无法使用推荐场景的上下文信息。对于所有用户来说，呈现商品的策略都是一样的，这并不能满足推荐系统的个性化要求。因此，需要增强上下文-分位数回归强化学习模型的适用性。

增强上下文分位数回归强化学习模型是一种将上下文信息、分位数回归和强化学习相结合的方法，旨在提升模型的学习能力和决策效果。通过这种增强，可以使模型在处理具有上下文条件的数据时更加准确、有效。

心得

机器学习算法有：监督学习、无监督学习和强化学习。

监督学习是通过标记好的训练数据进行学习；

无监督学习是从无标记的数据中学习。

强化学习更注重于通过与环境的交互来学习，并且在学习过程中没有明确的正误标记，而是通过奖励信号来指导代理的学习方向。

以往的研究大都聚焦在处理非线性信息，例如深度学习，将非线性信息转化为线性空间，然后使用线性模型回归，这种方式在信息完整时是可用的，但是在实际的推荐场景中，使用不完全信息建模时，模型需要处理的期望也是非线性的，这使得推荐算法的可用性受到影响。

标签：Based,Scholar,Algorithm,推荐,用户,Reinforcement,学习,算法,强化
From： https://www.cnblogs.com/wephiles/p/17966805