干逼视频无码免费网站,日韩专区欧美专区亚洲福利,国产A级免费无码播放,欧洲黄色片二区三区,在线免费看中文字幕

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.

相關內容

極大似然估計

關注 5

極大似然估計方法（Maximum Likelihood Estimate，MLE）也稱為最大概似估計或最大似然估計，是求估計的另一種方法，最大概似是1821年首先由德國數學家高斯（C. F. Gauss）提出，但是這個方法通常被歸功于英國的統計學家羅納德·費希爾（R. A. Fisher）它是建立在極大似然原理的基礎上的一個統計方法，極大似然原理的直觀想法是，一個隨機試驗如有若干個可能的結果A，B，C，... ，若在一次試驗中，結果A出現了，那么可以認為實驗條件對A的出現有利，也即出現的概率P(A)較大。極大似然原理的直觀想法我們用下面例子說明。設甲箱中有99個白球，1個黑球；乙箱中有1個白球．99個黑球。現隨機取出一箱，再從抽取的一箱中隨機取出一球，結果是黑球，這一黑球從乙箱抽取的概率比從甲箱抽取的概率大得多，這時我們自然更多地相信這個黑球是取自乙箱的。一般說來，事件A發生的概率與某一未知參數theta有關， theta取值不同，則事件A發生的概率P(A/theta)也不同，當我們在一次試驗中事件A發生了，則認為此時的theta值應是t的一切可能取值中使P(A/theta)達到最大的那一個，極大似然估計法就是要選取這樣的t值作為參數t的估計值，使所選取的樣本在被選的總體中出現的可能性為最大。

語言模型化 · 機器人 · MoDELS · Learning · INTERACT ·

2023 年 5 月 9 日

TidyBot: Personalized Robot Assistance with Large Language Models

Jimmy Wu,Rika Antonova,Adam Kan,Marion Lepert,Andy Zeng,Shuran Song,Jeannette Bohg,Szymon Rusinkiewicz,Thomas Funkhouser

from arxiv, Project page: //tidybot.cs.princeton.edu

For a robot to personalize physical assistance effectively, it must learn user preferences that can be generally reapplied to future scenarios. In this work, we investigate personalization of household cleanup with robots that can tidy up rooms by picking up objects and putting them away. A key challenge is determining the proper place to put each object, as people's preferences can vary greatly depending on personal taste or cultural background. For instance, one person may prefer storing shirts in the drawer, while another may prefer them on the shelf. We aim to build systems that can learn such preferences from just a handful of examples via prior interactions with a particular person. We show that robots can combine language-based planning and perception with the few-shot summarization capabilities of large language models (LLMs) to infer generalized user preferences that are broadly applicable to future interactions. This approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios.

話題模型 · Learning · MoDELS · Performer · 強化學習 ·

2023 年 5 月 8 日

Reinforcement Learning for Topic Models

Jeremy Costello,Marek Z. Reformat

from arxiv, 18 pages, 6 figures, Findings of ACL2023, code available at //github.com/jeremy-costello/rl-for-topic-models

We apply reinforcement learning techniques to topic modeling by replacing the variational autoencoder in ProdLDA with a continuous action space reinforcement learning policy. We train the system with a policy gradient algorithm REINFORCE. Additionally, we introduced several modifications: modernize the neural network architecture, weight the ELBO loss, use contextual embeddings, and monitor the learning process via computing topic diversity and coherence for each training step. Experiments are performed on 11 data sets. Our unsupervised model outperforms all other unsupervised models and performs on par with or better than most models using supervised labeling. Our model is outperformed on certain data sets by a model using supervised labeling and contrastive learning. We have also conducted an ablation study to provide empirical evidence of performance improvements from changes we made to ProdLDA and found that the reinforcement learning formulation boosts performance.

蒙特卡羅 · 期望回報 · Learning · Agent · 總回報 ·

2023 年 5 月 7 日

Truncating Trajectories in Monte Carlo Reinforcement Learning

Riccardo Poiani,Alberto Maria Metelli,Marcello Restelli

In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal, i.e., the expected return. In practice, in many tasks of interest, such as policy optimization, the agent usually spends its interaction budget by collecting episodes of fixed length within a simulator (i.e., Monte Carlo simulation). However, given the discounted nature of the RL objective, this data collection strategy might not be the best option. Indeed, the rewards taken in early simulation steps weigh exponentially more than future rewards. Taking a cue from this intuition, in this paper, we design an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths, i.e., truncated. The proposed approach provably minimizes the width of the confidence intervals around the empirical estimates of the expected return of a policy. After discussing the theoretical properties of our method, we make use of our trajectory truncation mechanism to extend Policy Optimization via Importance Sampling (POIS, Metelli et al., 2018) algorithm. Finally, we conduct a numerical comparison between our algorithm and POIS: the results are consistent with our theory and show that an appropriate truncation of the trajectories can succeed in improving performance.

Agent · 賭博機/老虎機 · 估計/估計量 · ARM · 設計 ·

2023 年 5 月 7 日

Repeated Principal-Agent Games with Unobserved Agent Rewards and Perfect-Knowledge Agents

Ilgin Dogan,Zuo-Jun Max Shen,Anil Aswani

from arxiv, 50 pages, 4 figures

Motivated by a number of real-world applications from domains like healthcare and sustainable transportation, in this paper we study a scenario of repeated principal-agent games within a multi-armed bandit (MAB) framework, where: the principal gives a different incentive for each bandit arm, the agent picks a bandit arm to maximize its own expected reward plus incentive, and the principal observes which arm is chosen and receives a reward (different than that of the agent) for the chosen arm. Designing policies for the principal is challenging because the principal cannot directly observe the reward that the agent receives for their chosen actions, and so the principal cannot directly learn the expected reward using existing estimation techniques. As a result, the problem of designing policies for this scenario, as well as similar ones, remains mostly unexplored. In this paper, we construct a policy that achieves a low regret (i.e., square-root regret up to a log factor) in this scenario for the case where the agent has perfect-knowledge about its own expected rewards for each bandit arm. We design our policy by first constructing an estimator for the agent's expected reward for each bandit arm. Since our estimator uses as data the sequence of incentives offered and subsequently chosen arms, the principal's estimation can be regarded as an analogy of online inverse optimization in MAB's. Next we construct a policy that we prove achieves a low regret by deriving finite-sample concentration bounds for our estimator. We conclude with numerical simulations demonstrating the applicability of our policy to real-life setting from collaborative transportation planning.

寬度 · 優化器 · 縮放 · 示例 · 約束 ·

2023 年 5 月 7 日

DPMS: An ADD-Based Symbolic Approach for Generalized MaxSAT Solving

Anastasios Kyrillidis,Moshe Y. Vardi,Zhiwei Zhang

Boolean MaxSAT, as well as generalized formulations such as Min-MaxSAT and Max-hybrid-SAT, are fundamental optimization problems in Boolean reasoning. Existing methods for MaxSAT have been successful in solving benchmarks in CNF format. They lack, however, the ability to handle 1) (non-CNF) hybrid constraints, such as XORs and 2) generalized MaxSAT problems natively. To address this issue, we propose a novel dynamic-programming approach for solving generalized MaxSAT problems with hybrid constraints -- called \emph{Dynamic-Programming-MaxSAT} or DPMS for short -- based on Algebraic Decision Diagrams (ADDs). With the power of ADDs and the (graded) project-join-tree builder, our versatile framework admits many generalizations of CNF-MaxSAT, such as MaxSAT, Min-MaxSAT, and MinSAT with hybrid constraints. Moreover, DPMS scales provably well on instances with low width. Empirical results indicate that DPMS is able to solve certain problems quickly, where other algorithms based on various techniques all fail. Hence, DPMS is a promising framework and opens a new line of research that invites more investigation in the future.

強化學習 · MoDELS · 學成 · contrastive · 最優化 ·

2021 年 12 月 8 日

Recent Advances in Reinforcement Learning in Finance

Ben Hambly,Renyuan Xu,Huining Yang

from arxiv, 60 pages, 1 figure

The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision-making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value and policy based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. Our survey concludes by discussing the application of these RL algorithms in a variety of decision-making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo-advising.

主動學習 · 自由能 · Extensibility · 學成 · TAP ·

2021 年 12 月 2 日

Active Learning for Domain Adaptation: An Energy-based Approach

Binhui Xie,Longhui Yuan,Shuang Li,Chi Harold Liu,Xinjing Cheng,Guoren Wang

from arxiv, Accepted by AAAI 2022. Code is available at //github.com/BIT-DA/EADA

Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit free energy biases when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of targe data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at //github.com/BIT-DA/EADA.

學成 · 強化學習 · 回合 · INTERACT · Next ·

2020 年 3 月 10 日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Sanmit Narvekar,Bei Peng,Matteo Leonetti,Jivko Sinapov,Matthew E. Taylor,Peter Stone

Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research.

注意力機制 · 機器閱讀理解 · Extensibility · state-of-the-art · MoDELS ·

2018 年 4 月 25 日

Reinforced Mnemonic Reader for Machine Reading Comprehension

Minghao Hu,Yuxing Peng,Zhen Huang,Xipeng Qiu,Furu Wei,Ming Zhou

from arxiv, Published in 26th International Joint Conference on Artificial Intelligence (IJCAI), 2018

In this paper, we introduce the Reinforced Mnemonic Reader for machine reading comprehension tasks, which enhances previous attentive readers in two aspects. First, a reattention mechanism is proposed to refine current attentions by directly accessing to past attentions that are temporally memorized in a multi-round alignment architecture, so as to avoid the problems of attention redundancy and attention deficiency. Second, a new optimization approach, called dynamic-critical reinforcement learning, is introduced to extend the standard supervised method. It always encourages to predict a more acceptable answer so as to address the convergence suppression problem occurred in traditional reinforcement learning algorithms. Extensive experiments on the Stanford Question Answering Dataset (SQuAD) show that our model achieves state-of-the-art results. Meanwhile, our model outperforms previous systems by over 6% in terms of both Exact Match and F1 metrics on two adversarial SQuAD datasets.

2018 年 1 月 5 日

Deep Reinforcement Learning for List-wise Recommendations

Xiangyu Zhao,Liang Zhang,Zhuoye Ding,Dawei Yin,Yihong Zhao,Jiliang Tang

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.