云南虫谷在线观看免费观看电视剧_中文字幕无码乱人伦漫画_最新亚洲中文字幕_欧美日韩国产不卡在线观看_国产精品456在线播放_国产成人精品天堂_亚洲欧美日韩综合一区二区

Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

相關內容

值迭代

關注 0

Processing（編程語言） · Facebook AI Research · 泛化理論 · 狀態空間 · MoDELS ·

2021 年 11 月 3 日

Fair Mutual Exclusion for N Processes (extended version)

Yousra Hafidi,Jeroen J. A. Keiren,Jan Friso Groote

from arxiv, To appear in TMPA'21

Peterson's mutual exclusion algorithm for two processes has been generalized to $N$ processes in various ways. As far as we know, no such generalization is starvation free without making any fairness assumptions. In this paper, we study the generalization of Peterson's algorithm to $N$ processes using a tournament tree. Using the mCRL2 language and toolset we prove that it is not starvation free unless weak fairness assumptions are incorporated. Inspired by the counterexample for starvation freedom, we propose a fair $N$-process generalization of Peterson's algorithm. We use model checking to show that our new algorithm is correct for small $N$. For arbitrary $N$, model checking is infeasible due to the state space explosion problem, and instead, we present a general proof that, for $N \geq 4$, when a process requests access to the critical section, other processes can enter first at most $(N-1)(N-2)$ times.

線性的 · Processing（編程語言） · 回火 · 噪聲 · 正交 ·

2021 年 11 月 3 日

Generalized Stochastic Processes: Linear Relations to White Noise and Orthogonal Representations

R. Carrizo Vergara

We present two linear relations between an arbitrary (real tempered second order) generalized stochastic process over $\mathbb{R}^{d}$ and White Noise processes over $\mathbb{R}^{d}$. The first is that any generalized stochastic process can be obtained as a linear transformation of a White Noise. The second indicates that, under dimensional compatibility conditions, a generalized stochastic process can be linearly transformed into a White Noise. The arguments rely on the regularity theorem for tempered distributions, which is used to obtain a mean-square continuous stochastic process which is then expressed in a Karhunen-Lo\`eve expansion with respect to a convenient Hilbert space. The first linear relation obtained allows also to conclude that any generalized stochastic process has an orthogonal representation as a series expansion of deterministic tempered distributions weighted by uncorrelated random variables with summable variances. This representation is then used to conclude the second linear relation.

Continuity · 離散化 · 最優化 · CASE · 優化器 ·

2021 年 11 月 1 日

Minimax Optimization: The Case of Convex-Submodular

Arman Adibi,Aryan Mokhtari,Hamed Hassani

Minimax optimization has been central in addressing various applications in machine learning, game theory, and control theory. Prior literature has thus far mainly focused on studying such problems in the continuous domain, e.g., convex-concave minimax optimization is now understood to a significant extent. Nevertheless, minimax problems extend far beyond the continuous domain to mixed continuous-discrete domains or even fully discrete domains. In this paper, we study mixed continuous-discrete minimax problems where the minimization is over a continuous variable belonging to Euclidean space and the maximization is over subsets of a given ground set. We introduce the class of convex-submodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable. Even though such problems appear frequently in machine learning applications, little is known about how to address them from algorithmic and theoretical perspectives. For such problems, we first show that obtaining saddle points are hard up to any approximation, and thus introduce new notions of (near-) optimality. We then provide several algorithmic procedures for solving convex and monotone-submodular minimax problems and characterize their convergence rates, computational complexity, and quality of the final solution according to our notions of optimally. Our proposed algorithms are iterative and combine tools from both discrete and continuous optimization. Finally, we provide numerical experiments to showcase the effectiveness of our purposed methods.

正則化項 · 價值函數 · 線性的 · 學成 · 策略評估 ·

2021 年 11 月 1 日

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

Wenhao Zhan,Shicong Cen,Baihe Huang,Yuxin Chen,Jason D. Lee,Yuejie Chi

Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term. Focusing on an infinite-horizon discounted Markov decision process, this paper proposes a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent Lan (2021), the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly over an entire range of learning rates, in a dimension-free fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the applicability and appealing performance of GPMD.

目標函數 · 泛函 · Less · 平滑 · MoDELS ·

2021 年 10 月 31 日

Second order semi-smooth Proximal Newton methods in Hilbert spaces

Bastian P?tzl,Anton Schiela,Patrick Jaap

from arxiv, 31 pages, 4 figures

We develop a globalized Proximal Newton method for composite and possibly non-convex minimization problems in Hilbert spaces. Additionally, we impose less restrictive assumptions on the composite objective functional considering differentiability and convexity than in existing theory. As far as differentiability of the smooth part of the objective function is concerned, we introduce the notion of second order semi-smoothness and discuss why it constitutes an adequate framework for our Proximal Newton method. However, both global convergence as well as local acceleration still pertain to hold in our scenario. Eventually, the convergence properties of our algorithm are displayed by solving a toy model problem in function space.

廣義函數 · 泛函 · 樣本復雜度 · 情景 · 轉移核 ·

2021 年 10 月 30 日

Towards General Function Approximation in Zero-Sum Markov Games

Baihe Huang,Jason D. Lee,Zhaoran Wang,Zhuoran Yang

This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes.

馬爾可夫過程 · Processing（編程語言） · 振蕩 · 類別 · INTERACT ·

2021 年 10 月 30 日

Non-reversible processes: GENERIC, Hypocoercivity and fluctuations

Manh Hong Duong,Michela Ottobre

from arxiv, 51 pages

We consider two approaches to study non-reversible Markov processes, namely the Hypocoercivity Theory (HT) and GENERIC (General Equations for Non-Equilibrium Reversible-Irreversible Coupling); the basic idea behind both of them is to split the process into a reversible component and a non-reversible one, and then quantify the way in which they interact. We compare such theories and provide explicit formulas to pass from one formulation to the other; as a bi-product we give a simple proof of the link between reversibility of the dynamics and gradient flow structure of the associated Fokker-Planck equation. We do this both for linear Markov processes and for a class of nonlinear Markov process as well. We then characterize the structure of the Large deviation functional of generalised-reversible processes; this is a class of non-reversible processes of large relevance in applications. Finally, we show how our results apply to two classes of Markov processes, namely non-reversible diffusion processes and a class of Piecewise Deterministic Markov Processes (PDMPs), which have recently attracted the attention of the statistical sampling community. In particular, for the PDMPs we consider we prove entropy decay.

策略改進 · 優化器 · 樣本 · Performer · 評論員 ·

2021 年 10 月 29 日

Generalized Proximal Policy Optimization with Sample Reuse

James Queeney,Ioannis Ch. Paschalidis,Christos G. Cassandras

from arxiv, To appear in 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency.

解碼 · 次最優 · 推斷 · 圖像字幕 · Performer ·

2019 年 2 月 28 日

Insertion-based Decoding with automatically Inferred Generation Order

Jiatao Gu,Qi Liu,Kyunghyun Cho

from arxiv, New version with clearer formulations and extended pages. Work in progress

Conventional neural autoregressive decoding commonly assumes a fixed left-to-right generation order, which may be sub-optimal. In this work, we propose a novel decoding algorithm -- InDIGO -- which supports flexible sequence generation in arbitrary orders through insertion operations. We extend Transformer, a state-of-the-art sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or adaptive orders obtained from beam-search. Experiments on four real-world tasks, including word order recovery, machine translation, image caption and code generation, demonstrate that our algorithm can generate sequences following arbitrary orders, while achieving competitive or even better performance compared to the conventional left-to-right generation. The generated sequences show that InDIGO adopts adaptive generation orders based on input information.

小樣本學習 · 學成 · 零試學習 · 有向 · Performer ·

2017 年 10 月 26 日

A Unified approach for Conventional Zero-shot, Generalized Zero-shot and Few-shot Learning

Shafin Rahman,Salman H. Khan,Fatih Porikli

Prevalent techniques in zero-shot learning do not generalize well to other related problem scenarios. Here, we present a unified approach for conventional zero-shot, generalized zero-shot and few-shot learning problems. Our approach is based on a novel Class Adapting Principal Directions (CAPD) concept that allows multiple embeddings of image features into a semantic space. Given an image, our method produces one principal direction for each seen class. Then, it learns how to combine these directions to obtain the principal direction for each unseen class such that the CAPD of the test image is aligned with the semantic embedding of the true class, and opposite to the other classes. This allows efficient and class-adaptive information transfer from seen to unseen classes. In addition, we propose an automatic process for selection of the most useful seen classes for each unseen class to achieve robustness in zero-shot learning. Our method can update the unseen CAPD taking the advantages of few unseen images to work in a few-shot learning scenario. Furthermore, our method can generalize the seen CAPDs by estimating seen-unseen diversity that significantly improves the performance of generalized zero-shot learning. Our extensive evaluations demonstrate that the proposed approach consistently achieves superior performance in zero-shot, generalized zero-shot and few/one-shot learning problems.