网站首页
编程语言
数据库
系统相关
其他分享
编程问答
PbRL
2024-11-20
PbRL | Christiano 2017 年的开山之作,以及 Preference PPO / PrefPPO
PrefPPO首次(?)出现在PEBBLE,作为pebble的一个baseline,是用PPO复现Christianoetal.(2017)的PbRL算法。Forevaluation,wecomparetoChristianoetal.(2017),whichisthecurrentstate-of-the-artapproachusingthesametypeoffeedback.Theprimarydif
2024-07-25
RIME:用交叉熵 loss 大小分辨 preference 是否正确 + 内在奖励预训练 reward model
文章题目:RIME:RobustPreference-basedReinforcementLearningwithNoisyPreferences,ICML2024Spotlight,368(?)pdf:https://arxiv.org/pdf/2402.17257html:https://arxiv.org/html/2402.17257v3或https://ar5iv.labs.arxiv.org/html/2402.17257v3GitHub:https://g
2024-03-06
PbRL Preference Transformer
论文题目:PreferenceTransformer:ModelingHumanPreferencesusingTransformersforRL,ICLR2023,5668,poster。pdf:https://arxiv.org/pdf/2303.00957.pdfhtml:https://ar5iv.labs.arxiv.org/html/2303.00957openreview:https://openreview.net/forum?id=Peot1SFDX0项
2024-03-06
PbRL | Preference Transformer:反正感觉 transformer 很强大
论文题目:PreferenceTransformer:ModelingHumanPreferencesusingTransformersforRL,ICLR2023,5668,poster。pdf:https://arxiv.org/pdf/2303.00957.pdfhtml:https://ar5iv.labs.arxiv.org/html/2303.00957openreview:https://openreview.net/forum?id=Peot1SFDX0项
2024-02-27
offline RL · RLHF · PbRL | OPPO:PbRL 场景的 offline hindsight transformer
论文题目:BeyondReward:OfflinePreference-guidedPolicyOptimization,ICML2023,3368reject。(已经忘记当初为何加进readinglist了,可能因为abstract太炫酷了?就当作学习经验教训吧…)材料:pdf版本:https://arxiv.org/pdf/2305.16217.pdfhtml版本:https://ar5iv.labs
2023-12-17
offline RL | Pessimistic Bootstrapping (PBRL):在 Q 更新中惩罚 uncertainty,拉低 OOD Q value
论文题目:PessimisticBootstrappingforUncertainty-DrivenOfflineReinforcementLearning,ICLR2022,6688spotlight。pdf版本:https://arxiv.org/abs/2202.11566html版本:https://ar5iv.labs.arxiv.org/html/2202.11566openreview:https://openreview.net/forum?id=Y4c
2023-12-17
RLHF · PbRL | 选择 near on-policy query,加速 policy learning 收敛速度
论文题目:Query-PolicyMisalignmentinPreference-BasedReinforcementLearning,ICML2023Workshop“TheManyFacetsofPreference-BasedLearning”。(其实不太知道workshop是什么概念…)pdf版本:https://arxiv.org/abs/2305.17400html版本:https://ar5iv.labs.arxiv.or
2023-11-30
RLHF · PBRL | B-Pref:生成多样非理性 preference,建立 PBRL benchmark
论文题目:B-Pref:BenchmarkingPreference-BasedReinforcementLearning,2021NeurIPSTrackDatasetsandBenchmarks,778。openreview:https://openreview.net/forum?id=ps95-mkHF_pdf版本:https://arxiv.org/pdf/2111.03026.pdfhtml版本:https://ar5iv.labs.arxiv.org/ht
2023-11-13
RLHF · PBRL | 发现部分 D4RL tasks 不适合做 offline reward learning 的 benchmark
论文题目:BenchmarksandAlgorithmsforOfflinePreference-BasedRewardLearning,TMLR20230103发表。openreview:https://openreview.net/forum?id=TGuXXlbKsnpdf版本:https://arxiv.org/pdf/2301.01392.pdfhtml版本:https://ar5iv.labs.arxiv.org/html/2301.01392目
2023-11-11
RLHF · PBRL | SURF:使用半监督学习,对 labeled segment pair 进行数据增强
论文名称:SURF:Semi-supervisedrewardlearningwithdataaugmentationforfeedback-efficientpreference-basedreinforcementlearning,ICLR2022,分数666接收,又是PieterAbbeel组的(恼)。(最近的readinglist里全是他们组的文章,已经读过了PEBBLE和RUNE并撰写阅读
2023-11-09
RLHF · PBRL | PEBBLE:通过 human preference 学习 reward model
论文题目:PEBBLE:Feedback-EfficientInteractiveReinforcementLearningviaRelabelingExperienceandUnsupervisedPre-training,貌似是ICML2021的文章。本博客为论文阅读笔记,【不能代替】阅读原文的工作量。原文写的也很好,是AI顶会的风格,相对容易读懂。阅读材料:p